1 Learning goals

Sequence alignments should be a familiar topic from introductory bioinformatics. Most people have used BLAST to search with some sequence against some database. In this module we will have a look at various forms of alignment methods related to microbial genomics, and how we can use them on our HPC.

In this module we first focus on the mapping of reads to some reference genome. The aim is to

Understand the idea behind the Burrows-Wheeler transform for super-fast matching of short reads to some reference.
How to build an index from a genome and map short reads to this using a software tool.
How to process the .sam-files we get from such read-mapping, to compute read coverages.
Start to get some grips on how to read the help documentation we get from command line tools.

We also find place in this module to look at how we can search for specific features (genes) in our assembled contigs or by using the reads directly. This is useful whenever you are looking for something specific in your sequenced genomes. Further aims:

Learn how to make scripts to perform sequence similarity searches with BLAST
Learn how to do similar searches directly among the reads, to make the results independent of the assembly
We will also see how to include R code in our shell scripts, and pass information from shell to R

1.1 Software used in this module

2 Mapping short reads

Mapping reads to some reference means we take each sequenced read and try to find a place on some reference sequence where it aligns. Typically, the reference sequence is a genome already assembled from some other data, while the reads have been sequenced from a new genome, being very similar to the reference. We may then find

Deletions. If some positions on the reference has no reads mapping, this region is probably deleted in the new genome.
Insertions. This is more indirectly established, but if a number of reads never map to the reference, and these overlap each other severely, it indicates some new genomic region in the new genome that was never seen in the reference.
Duplications. If a region has much larger coverage than the reference had, the new genome must have more copies of this region.
The SNV’s, i.e. Single Nucleotide Variants. These are single bases where this new genome has a different base than the reference.

Another example of the use of read mapping is when we Use our assembled contigs as the reference, and we map the same reads to these to get the exact read coverage for each contig. This may be used to evaluate contigs. If we have sequenced a single genome, we expect all contigs to have more or less the same read coverage.

Thus, all kinds of read mapping requires some reference. Note that a reference does not have to be a full genome. It could be we are only interested in some smaller part of the genome(s) here, even a single gene.

Read mapping to some reference genome is perhaps not as commonly used for microbes as it is in eukaryotic genomics. The reason is the difference in genomes. When sequencing the genome of a fish, plant or animal, one individuals genome is extremely similar to all the other individuals of the same species. For humans we have The human reference genome, and all persons’ genomes are extremely similar to this. Using this as the reference, we expect that virtually all reads from the new person will match somewhere, and most of them matches perfectly. Some places we may find Single Nucleotide Variants (SNVs) or other small genotype deviations from the reference that may be potentially important for the phenotype of the new person. In eukaryotes ‘all’ individuals from the same species share the same genes, and the differences in phenotype are largely due to small deviations in the regulatory parts of the genome. This is true even across closely related species. Read mapping makes sense since the ‘average genome’ will always be a very good reference for any single genome.

In the bacterial world, however, the strains are the ‘individuals’, and even between strains of the same species there may be huge genotype differences. As an example, between two strains of E. coli there can be as much as 20% of the genes present in one strain lacking completely in the other! Thus, finding a proper reference to map against is not as straightforward, since other genomes of the same species may be quite different from the one we have sequenced. Also, one may question the relevance of small SNV variants we may find if there are at the same time huge differences involving entire genes.

Anyway, we should be able to do some readmapping, i.e. aligning reads to (pieces of) genomes, and understand some of the basics behind it.

2.1 The Burrows-Wheeler transform

Assume we have done some Illumina sequencing, and we have a set of many short and precise reads. We would like to align these to some reference, i.e. finding where they match.

You should be familar with alignments from a basic course in bioinformatics, e.g. by the use of dynamic programming or BLAST. Aligning hundreds of thousands of short reads to a long reference sequence sounds like an enormous computer job even for BLAST. It is, and if you try to use BLAST for this, it would take quite some time. Note that BLAST will work, but it is slow. Also, with short and precise reads we would be stricter than BLAST. BLAST looks for local similarities, but here we expect to see that

The entire short read aligns against…
…a short piece of the long reference

BLAST is not optimized to handle this special situation. BLAST is an excellent tool for searching with a protein sequence in a database of proteins (we will see later in thsi module). The problem we are faced with here is different:

The query sequences (reads) are all short.
The ‘database’ consist of one (or a few) very long sequences (genomes).
Most of the queries match the reference perfectly.
We are only interested in where on the reference a query matches, not the alignment itself.

In the following we will focus on exact matching. The Burrows-Wheeler and FM-indexing is best explained in this setting, but in practice it has been extended to cope with a small number of mismatches as well. If a read has many mismatches, it will not align with this approach, but this is fine actually. Such reads are probably full of errors, and cannot be trusted anyway.

Ben Langmead, that we saw in some previous videos, has also written some texts for his online course.

Here is one such text, which is in my eyes a very good introduction to Burrows-Wheeler and the FM-index. You should take some time reading through this on your own.

Here is my video lecture over this topics building on what Ben Langmead writes.

2.2 The `bowtie2` software

Let us use the bowtie2 software to do some read mapping.

In module 4 we assembled reads, and used quast to compare the contigs to a reference genome. This reference genome should in this particular case actually be identical to the genome that was sequenced, not just some near relative. Let us now map the reads directly to this genome. We will then count how many reads cover each position in the genome, i.e. the read coverage at each single position. Remember from a previous module how read coverage should, in an ideal situation with no sequencing errors etc, be a Poisson variable. Let us now see to what extent this is the case here.

First, download the newest container for bowtie2 from galaxy, and save in your module5 folder. NB! It is not the bioconductor-container (that may pop up on the galaxy website) you need, but one named only bowtie2 plus some version numbering.

We need to create the database or index containing the reference. Let us start building the shell script:

#!/bin/bash

#SBATCH --nodes=1                        # We always use 1 node
#SBATCH --reservation=BIN310             # For BIN310 course only
#SBATCH --account=bin310                 # For BIN310 course only
#SBATCH --ntasks=10                      # The number of threads reserved
#SBATCH --mem=30G                        # The amount of memory reserved
#SBATCH --time=01:00:00                  # Runs for maximum this time
#SBATCH --job-name=bowtie2               # Sensible name for the job
#SBATCH --output=bowtie2_%j.log          # Logfile output here

##############
### Settings
###
threads=10
ref=$COURSES/BIN310/module4/Bacillus_cereus_ATCC14579.fasta
out_dir=$SCRATCH/bowtie2
index=$out_dir/bowtie2_index

####################################
### Testing if $out_dir exists
### If yes, delete all files in it
### If no, create it
###
if [ ! -d $out_dir ]
then
    mkdir $out_dir
fi

######################
### Building index
###
apptainer exec $HOME/module5/bowtie2:2.5.1--py39h6fed5c7_2.sif bowtie2-build \
--threads $threads $ref $index

Copy this into a shell script, edit and save in your module5 folder.

Again we define an $out_dir under our $SCRATCH, but unlike the software we have used before, bowtie2 will not create this folder if it doesn’t exist. We have to create it ourselves. In the script above we use some more shell coding, an if-statement. Let us have a short look at this, for more details see LinuxCommand.org - Flow Control part 1 .

You should be familiar with the if-else concept from coding in R, python or other languages. Above is the syntax for bash. We start with if and an expression, and the code to (potentially) be executed is between then and fi. The statement [ ! -d $out_dir ] is an expression that tests for the non-existence of the folder $out_dir, resulting in a logical TRUE or FALSE. If TRUE, i.e. $out_dir does not exist, we create it by mkdir $out_dir. In general, the expression [ -d $out_dir ] test if $out_dir exists, and with the exclamation mark [ ! -d $out_dir ] we negate the expression, testing if it does not exist. We could have used an else-branch as well, like this:

if [ -d $out_dir ]
then
  rm $out_dir/*.*
else 
  mkdir $out_dir
fi

This should read: If the folder $out_dir exists, delete all files in it, if not create it. We will probably see similar if-statements again later, and

The command for building the index is bowtie2-build, and we need of course to supply a reference (fasta-file) and give the index some name ($index_name). If you run this now, you will see that in $out_dir several files pop up, but they all have the same prefix. When referring to the index later, we simply use this prefix, i.e. the text in $index.

Let us add some more code to the script to also do the read mapping:

##############
### Settings
###
r1=$COURSES/BIN310/module2/Illumina_MiSeq_R1.fastq.gz
r2=$COURSES/BIN310/module2/Illumina_MiSeq_R2.fastq.gz
sam_file=$out_dir/illumina_genome.sam

###########################
### Mapping reads to index
###
apptainer exec $HOME/module5/bowtie2:2.5.1--py39h6fed5c7_2.sif bowtie2 \
-q --sensitive --threads $threads --minins 0 --maxins 1000 -x $index -1 $r1 -2 $r2 -S $sam_file

In the command line we use the option --sensitive. You may adjust bowtie2 to be fast or sensitive, see the Help text for more details on this. The options --minins and --maxins sets the outer limits of the inset size or fragment length. Since we have paired-end reads, the reads in a pair will map within these limits if correctly mapped. By default the limits are 0 and 500 bases, but we set the upper limit to 1000 bases here because these data have quite long insert sizes. This mainly affects how bowtie2 reports the aligning, not so much the alignment rate. The output from this is a sam-file, which is a text file on the sam-format, and should have file ending .sam. We will have a look at this below.

Adding this to the script, and running, it should finish within a minute or even less.

Let us briefly inspect the output from this.

2.3 The `samtools` software

The sam-format is a common format for the output from read mapping software. We need a brief insight into this format. A .sam file produced by an aligner software (like bowtie2) is a text file with

A header-section, where all lines starts with a @.
An alignment section consisting of at least 11 tab-separated columns of information. There may be more than 11 columns, but the first 11 columns must always be the same

Have a look at this document describing the sam format. We focus only on the alignment section, described in section 1.4 of that document.

There is a family of software tools made to process such files, and many of them are found in the samtools software package. Let us see how we can use this to compute the read coverage for every single position along the reference genome, based on the output in the $sam_file we specified above. Instead of making a new script for this, we add the following code to the bowtie2 script from above.

First, we need a container for samtools, download and save this as usual. In the Terminal, we run

apptainer exec samtools:1.17--hd87286a_1.sif samtools --help

just to see if samtools is a valid command, and indeed it is. We get listed a set of sub-commands. To compute the read coverage e need to run:

samtools view in order to convert our sam-file to a bam-file. The latter is a compressed version of the sam-file, and required for the next sub-command.
samtools sort of the bam-file. This will re-arrange its content such that aligned reads are sorted by where they align on the reference. This is required by the next sub-command.
samtools depth of the sorted bam-file, to produce a three-columns text file with the read coverage for every single position of the genome. This is our final output.

When running several commands like this, we typically use piping in UNIX. Instead of storing the output from one command in some file, and then feed this as input to the next (we could do that), we pipe the commands together. Here is an example:

##################
### Settings
###
tbl_file=coverage_illumina.txt

###########################
### Processing .sam output
###
apptainer exec $HOME/module5/samtools:1.17--hd87286a_1.sif samtools view -@ $threads -b $sam_file | \
apptainer exec $HOME/module5/samtools:1.17--hd87286a_1.sif samtools sort -@ $threads | \
apptainer exec $HOME/module5/samtools:1.17--hd87286a_1.sif samtools depth -aa > $tbl_file /dev/stdin

Make certain you understand what we do here:

We start by samtools view -b $sam_file in order to take as input the $sam_file and convert it to bam-format. This is then piped into…
…samtools sort, that will sort everything according to position on reference. This is then piped into…
…samtools depth, where we compute the depth (=read coverage) on all positions (option -aa), also those where no reads may have mapped. This is then finally re-directed (>) to a file (named in $tbl_file). The very last /dev/stdin is needed for the last pipe to work.

Add this code to the bowtie2 script, and re-run (sbatch). The file coverage_illumina.txt should pop up in the module5 folder.

2.3.1 Exercise - plot and inspect read coverage

The file coverage_illumina.txt should now contain 3 tab-separated columns, and one row for every position along the genome. Read this file into R (use read_delim()). The first column should be a text indicating the genome sequence, the second column is the position the the genome sequence, and the third column the read coverage. It is probably a good idea to name the columns by using col_names = c("Genome_sequence", "Position", "Coverage") as an argument in read_delim().

This genome has one chromosome and one plasmid, i.e. there are two different genome sequences! Then, for both the chromosome and the plasmid, count how many times we see the various coverages. Use group_by() to group the table rows by their genome sequence and coverage values, and then use summarise(count = n()) to compute the counts. This means you for each genome sequence count how often we observe coverage 0, 1, 2,…etc. Make a bar plot of this (geom_col()). Compare to the theoretical (Poisson) distribution we saw in module 2. What is the read coverage of the chromosome here (approximated from the figure)?

Next, inspect the coverage values again. Are there positions in this genome with read coverage below 10, and even below 1? The latter are parts of the genome that has never been sequenced. How does this compare to the theoretical results around read coverage we looked at in module 2?

2.3.2 Exercise solution

library(tidyverse)

coverage.tbl <- read_delim("coverage_illumina.txt", delim = "\t",
                           col_names = c("Genome_sequence", "Position", "Coverage"))
coverage.tbl %>%
  group_by(Genome_sequence, Coverage) %>%
  summarize(Count = n()) -> illumina_count.tbl

fig1 <- ggplot(illumina_count.tbl) +
  geom_col(aes(x = Coverage, y = Count)) +
  facet_wrap(~Genome_sequence, ncol = 1, scales = "free")
print(fig1)

coverage.tbl %>%
  filter(Coverage < 1) -> not.sequenced.tbl
cat("There are", nrow(not.sequenced.tbl), "positions with read coverage zero\n")
# Inspect the table not.sequenced.tbl to see which Positions these are.

3 Reading software documentation

Let us stop here for a few minutes and have a closer look at how you should start to learn how to use various software on your own. So far you have more or less just copied shell script codes from these documents. In real life you will need to build your scripts from scratch, using software we have yet not seen. To be able to do this, it is a good help in understanding the documentation given by the software.

Here is a short video on how to start reading help documentation for software tools

4 Mapping long reads

Let us now replace the paired-end short reads from Illumina sequencing with the long Nanopore reads. How does this change the problem of read mapping?

The long reference sequence(s) is still the same. However, the reads are now fewer, longer and with a lot more errors. This means the Burrows-Wheeler idea looses much of its power. Instead of looking for many and almost perfect matches, we must align long reads to long references, and expect quite many mismatches in these alignments.

In this situation a popular approach is seed-and-extend. This is actually quite similar to the original idea behind the BLAST algorithm, see BIN210. It means, when comparing two rather long sequences, we first look for shorter regions they both share, and then we try to build an alignment between them by extending around these regions.

One software using this idea is the minimap2 we will use below, and the paper behind it is found here: https://pubmed.ncbi.nlm.nih.gov/29750242/. Download the container for this into your module5 folder in the usual way.

4.0.1 Exercise - Nanopore read coverage

Use the minimap2 to map the reads in $COURSES/BIN310/module4/Nanopore.fastq.gz to the exact same reference genome as we used with bowtie2 above.

Use the help text (from apptainer exec <container> minimap2 -help) to try to create the required command line. Hints:

Copy the bowtie2 script and edit this.
The Usage: line tells you the structure of the command line
Unlike bowtie2 you do not need to build any index here.
We want to have a SAM formatted output file.
Use 10 threads if possible.

Use the same samtools code as in the bowtie2 script to compute read coverages from this sam-file, and output to a file named coverage_nanopore.txt.

Use the R script you made above to plot and inspect the coverages for these data as well, and compare to how it looked like for the Illumina data. It is perhaps best to compare only the coverages on the chromosome, since the plasmid is short and of unknown copy number really. What seem to be the coverage? How about very low coverages?

4.0.2 Exercise solution

#!/bin/bash

#SBATCH --nodes=1                        # We always use 1 node
#SBATCH --ntasks=10                      # The number of threads reserved
#SBATCH --mem=30G                        # The amount of memory reserved
#SBATCH --partition=smallmem,hugemem             # For < 100GB use smallmem, for >100GB use hugemem
#SBATCH --time=01:00:00                  # Runs for maximum this time
#SBATCH --job-name=minimap2               # Sensible name for the job
#SBATCH --output=minimap2_%j.log          # Logfile output here
#SBATCH --exclude=cn-12

##############
### Settings
###
threads=10
reads=$COURSES/BIN310/module4/Nanopore.fastq.gz
ref=$COURSES/BIN310/module4/Bacillus_cereus_ATCC14579.fasta
out_dir=$SCRATCH/minimap2
sam_file=$out_dir/nanopore_genome.sam
if [ ! -d $out_dir ]
then
    mkdir $out_dir
fi

######################
### Running software
###
apptainer exec $HOME/module5/minimap2:2.26--he4a0461_1.sif minimap2 \
-x map-ont -t $threads -a $ref $reads > $sam_file

##################
### Settings
###
tbl_file=coverage_nanopore.txt

###########################
### Processing .sam output
###
apptainer exec $HOME/module5/samtools:1.17--hd87286a_1.sif samtools view -@ $threads -b $sam_file | \
apptainer exec $HOME/module5/samtools:1.17--hd87286a_1.sif samtools sort -@ $threads | \
apptainer exec $HOME/module5/samtools:1.17--hd87286a_1.sif samtools depth -aa > $tbl_file /dev/stdin

The -x map-ont indicates the type of input data and what to do with these. The map-ont means map Nanopore data (ONT=Oxford Nanopore Technology). You may also use minimap2 for other types of data, even the short read Illumina data (-x sr). See the Help text to learn more about other data types and what you may do to them.

The option -a produces as .sam file output, which is not default for this software.

Note there is no explicit step for building an index, like we saw for bowtie2. We simply feed both the reference and the reads as input. Note also the redirect > of the output. We could have used the option -o $sam_file instead, but this is an alternative you often see. With no output specified, most software will output to what is called standard out /dev/stdout, and this usually means to the screen. Thus, the redirect > $sam_file will send the output into the specified file instead.

5 Searching for specific genes

Assume we have sequenced and assembled one or several genomes, and we have a set of contigs. In some cases we are interested in finding specific features in these genome(s), e.g. does it contain some antibiotic resistance genes or some other features we may be interested in. Let us have a look at some ways to do this. At the same time we will also learn how to combine R code within our shell code.

5.1 BLAST

In basic bioinformatics we learn about sequence alignments, i.e. how we align two or more sequences in order to compare them. This typically results in some alignment-score indicating how similar or not two sequences are. Perhaps the most widely used tool for sequence alignment is BLAST (Basic Local Alignment Search Tool). We use BLAST to

Search with a sequence we have, the query sequence…
…against some database of (many) sequences…
…to obtain a list of database-sequences with a minimum similarity to our query

BLAST computes pairwise alignment, i.e. our query sequence is matched to one database sequence at a time, their similarity is computed, and this is repeated for many (all) database sequences. Let us again visit Rob Edwards on YouTube, and hear what he has to say about BLAST. Most of this should be familiar stuff from basic bioinformatics, but have a look, if nothing else for the sake of repetition:

5.2 The `blast` software

You have probably all used BLAST on the web (e.g. at https://blast.ncbi.nlm.nih.gov/Blast.cgi). But BLAST can also be installed on your computer, see BLAST+ executables at NCBI for downloading and installing. We have also done this in BIN210.

Why do we want a local version of BLAST? If you have a single sequence, and wonder what this is, you typically use the web-version of BLAST, as mentioned above. However, if you want to compare some (many) sequences to some (many) other sequences you have, the local blast software is handy. You may also download the huge databases from NCBI, and search in these locally. The problem is the size. It takes a long time to download, and requires a lot of memory.

Let us use as example the above mentioned antibiotic resistance genes. We have a selection of known antibiotic resistance genes in the fasta-file $COURSES/BIN310/module5/ABR_genes.fasta. This has been downloaded from the CARD database. In module 4 we assembled some reads into contigs. Let us use the contigs we got from using spades on the Illumina reads. You may use your own, or a version found in $COURSES/BIN310/module5/spades_contigs.fasta.

We want to use the ABR genes as query sequences in a search against the contigs. The genes are all coding genes. Then we should always translate them into protein sequences, and do the search using these. The reason is protein similarity is easier to detect than nucleotide similarity, due to the 20 amino acids and their more detailed scoring regime (recapitulate this from BIN210 if needed).

Since the database (contigs) are nucleotides we need to use the tblastn version of BLAST. This means BLAST will translate the contigs into all six reading frames, and compare to our query proteins. Thus, we look for regions in the contigs having good matches to our query proteins. If we find such regions of some length, it may be an ABR gene, even if some more verification is usually needed.

Let us now

Make a BLAST database from the contigs
Translate the ABR genes into proteins
Use the ABR proteins as queries
Search with all queries against the database, using tblastn
Display some results of this in R

You need two fasta-files for any blast session, one with the database sequences and one with queries. In both cases, sequences are identified by the first token in the Header-line of the fasta-files. The first token is the first ‘word’ in the Header-line, i.e. before the first blank. Thus, it is vitally important that all fasta-files used in any blasting always have Header-lines where the first token is unique to every sequence. Otherwise, you will not be able to distinguish between sequences in the output!

Download a blast container as usual. Note that there are many containers at galaxy with blast in its name! It may be a good idea to search for (CTRL-f) the text ‘blast:2.14’ to get to the newest containers.

Let us build a BLAST database from the contigs. One of the tools included in blast is makeblastdb, which is the one we use to build a BLAST-database from a fasta-file. Here is some shell code for this:

#############
### Settings
##
threads=1
contigs_file=$COURSES/BIN310/module5/spades_contigs.fasta
out_dir=$SCRATCH/blast
dbase=$out_dir/spades_contigs

##########################
### Building the database
###
apptainer exec $HOME/module5/blast:2.14.1--pl5321h6f7f691_0.sif makeblastdb \
-in $contigs_file -dbtype nucl -out $dbase

Copy and edit this code into a proper shell script in your module5 folder (you also need a blast container). Notice we only use 1 thread here. Reserve 10GB of memory and set the wall time to 1 hour.

Then sbatch and inspect the $out_dir. A BLAST database consists of a number of files with different extensions, and the text you gave to $dbase as prefix. As usual, if you want to learn more on how to use makeblastdb, open a terminal window and

apptainer exec <container> makeblastdb -help

The blast software is actually quite liberal with respect to the ordering of the input.

5.3 Running R code inside shell scripts

Our antibiotic resistance genes are in DNA, and we need to translate these into protein sequence before we use them as query in the BLAST search. There are several way of doing this, e.g. a nice tool is seqkit that you will find a container for at galaxy. However, let us use this opportunity to do the translation in R, and then see how we may run our own R code inside a shell script!

Inside a shell script, we run R code using the command Rscript, with this template

module load R/4.3.1
Rscript -e "<put R code here>" $input1 $input2...etc

First, we use the option -e to indicate that the code now follows inside the double quotes "...". Alternatively, you could just put the same code in an ordinary R script and then just replace the -e "<put R code here>" part by the name of that file. If the code is long, this is probably a good idea, since it easier to debug a standard R script, but in this example we will type the code directly in here.

Please note that the R code must be inside the double quotes. This means the R code itself cannot contain double quotes! This is no problem, since single quotes, e.g. 'this is a text', works just as fine as double quotes ("this is as text") in R. Remember to use single quotes in your R code here now!

We then extend our shell script from above with the following code:

#############
### Settings
###
ABR_dna=$COURSES/BIN310/module5/ABR_genes.fasta
ABR_protein=$out_dir/$(basename $ABR_dna)

#####################
### Running R code
###
module load R/4.3.1
Rscript -e "library(microseq)
args <- commandArgs(trailingOnly = T)
in.file <- args[1]
out.file <- args[2]
readFasta(in.file) %>% 
  mutate(Sequence = translate(Sequence)) %>% 
  writeFasta(out.file)" $ABR_dna $ABR_protein

Note how we pass as input the two file names (in correct ordering!). The function commandArgs() in R will allow you to input arguments to R programs. The arguments will here be stored in the object args. Then we copy the first element of this into in.file and the second into out.file. At the end of the code we can see what is delivered as these two input arguments, namely $ABR_dna and $ABR_protein. The content of these is specified previously in the same file, and we could of course have typed this directly into the R code as well. However, it is nice to set the content of such variable one place only. If you later change them, you only need to edit one place, and the rest of the code will still work.

Add this to your shell script from above, and sbatch. Verify that the file in $ABR_protein now indeed is a fasta file with protein sequences (read in into R with readFasta() from microseq).

5.4 The BLAST search and results

Let us do the search against the database. There are many options to all the blast tools, and we will only use some of them here. Here are some lines to be added to the shell script:

##############
### Settings
###
out_file=blast_spades.txt

################
### Searching
###
apptainer exec $HOME/module5/blast:2.14.1--pl5321h6f7f691_0.sif tblastn \
-num_threads $threads -query $ABR_protein -db $dbase -out $out_file -outfmt 6

Add to the script, and sbatch. The $out_file will grow as tblastn fills it with results, it takes some time to complete.

In the tblastn command line we used the option outfmt 6. This tells tblastn (and any other version of blast) to output a table of results. This table-format is short and concise, and very nice for further wrangling in R! There will be one row for each alignment, and 12 columns. These column have no names in the output text file, but if you run tblastn -help you can see them listed. Here is some code to read this into R, and where I have added the column names as specified by the Help text:

blast.tbl <- read_delim("blast_spades.txt", delim = "\t",
                        col_names = c("qseqid", "sseqid", "pident", "length", "mismatch",
                                      "gapopen", "qstart", "qend", "sstart", "send",
                                      "evalue", "bitscore"))

The column names are:

Column name	Explanation
qseqid	Query sequence ID, the first token in a fasta header
sseqid	Database (Subject) sequence ID, first token again
pident	Percent identity in alignment
length	Length of alignment
mismatch	Number of mismatches in alignment
gapopen	Number of gap-opens in alignment
qstart	Start of alignment on query sequence
qend	End of alignment on query sequence
sstart	Start of alignment on database sequence
send	End of alignment on database sequence
evalue	The E-value of the alignment
bitscore	The bit-score of the alignment

There are many options we could have used in the search. Here are some of the more important ones:

Options like -word_size, -gapopen, -gapextend, -penalty and -reward are used to change the scoring rules.
Options -evalue and -perc_identity are used to limit output alignment quality.
Option -num_alignment is the upper limit of how many alignments are reported per database sequence.

The option -outfmt can be additionally configured. If we used -outfmt '6 qseqid sseqid evalue bitscore' we would have only got 4 output columns instead of the 12 we get by default. Run tblastn -help (with the proper apptainer code in front) to see all details.

Make the shell script from the codes above (copy and paste, edit container names/paths) and sbatch. Verify the output appears where it should. The file blast_spades.txt will gradually be filled with results as blast is searching, just like we saw with bowtie2 above.

5.4.1 Exercise - ABR genes in genome

Make an R script, read in the results as shown above, and filter the results to keep only alignments with evalue below 1e-5. Add code to answer these questions:

How many alignments with E-value below 1e-5 did tblastn report?
How many unique contigs have matches against some ABR-protein? Hint: distinct() on the sseqid column
How many unique ABR-proteins produced a match against some contig?

In the example above we searched against the contigs from spades. We also have a similar file with contigs from canu, assembled from Nanopore data for the same genome (in $COURSES/BIN310/module5/contigs_canu.fasta). Repeat the tblastn search with these contigs instead (good idea to change the name of $out_file). Do the results change?

5.4.2 Exercise solution

We assume you ran the shell script above, and made output file blast_spades.txt and blast_canu.txt, edit otherwise:

library(tidyverse)

out.file <- "blast_spades.txt"  # results from blasting against spades/illumina contigs
#out.file <- "blast_canu.txt"    # results from blasting against canu/nanopore contigs
blast.tbl <- read_delim(out.file, delim = "\t",
                        col_names = c("qseqid", "sseqid", "pident", "length", "mismatch",
                                      "gapopen", "qstart", "qend", "sstart", "send", "evalue", "bitscore")) %>%
  filter(evalue < 1e-05)

cat("Number of unique contigs with matches:", length(unique(blast.tbl$sseqid)), "\n")
cat("Number of unique ABR genes with matches:", length(unique(blast.tbl$qseqid)), "\n")
sum.tbl <- blast.tbl %>%
  group_by(sseqid) %>%
  summarize(n.match = n()) %>%
  arrange(desc(n.match))
head(sum.tbl)

Searching the spades contigs produced in total 9953 alignments. There are hits in 31 different contigs, and in total 1715 distinct ABR genes produced some hit. When searching the canu contigs there are 5414 alignments, all in 1 (big) contig, and 1224 distinct ABR genes had hits this time.

We have too many hits to believe that all of these are real. It cannot be that many ABR genes in this genome! First, we must remember that many of these ABR genes are very similar to each other, and several different ABR genes will probably align at the same region in the contigs. A BLAST search like this will list them all. Thus, we should actually look for distinct regions in the contigs who have hits against some ABR genes. Second, it may also be the E-value threshold is too liberal here, and we get hits to many regions who are not really ABR genes, just something similar.

The difference in the number of hits between spades-contigs and canu-contigs is also something to reflect upon. Remember, when we used quast we found both sets of contigs gave excellent reconstructions of the reference genome, close to 100% identity. Still, this smallish differences result in a rather large difference in hits here. One possible explanation is that the canu contigs, being assembled from error prone Nanopore reads, have more errors in them and this results in fewer hits. On the other hand, we may also think that the spades contigs, being assembled from short Illumina reads, result in the same region being (wrongly) assembled into different contigs, producing more hits.

5.4.3 Exercise - refine the results

Let us dig a little deeper into the results above.

Read the results for both spades_contigs.fasta and canu_contigs.fasta into R as tables. Arrange all the alignments in each table by bitscore with the largest at the top. The bitscore is the best criterion for ranking the alignments since a larger score means a longer alignment with fewer mistmatches/gaps. Inspect the tables. How good, in terms of bitscore, are the best alignments? What might this tell us about Illumina versus Nanopore data?

Next, let us ‘prune’ each table as follows: Starting at the top, eliminate alignment number i if it overlaps with an alignment with a larger bitscore. The information you need is in the columns sseqid, sstart and send. The sseqid is the text identifying each contig. The sstartand send are the start and end positions of the alignment on this contig. Here we must be aware of this:

In all BLAST outputs, if a start value is smaller than the corresponding end value, the alignment is on the positive strand
If the start is larger than the corresponding end, the alignment is on the negative strand

It may be a good idea to first create a new column, strand, keeping this information.

It may also be a good idea to make another column keep which is TRUE for all alignment initially. This we change to FALSE for those alignment we want to discard due to overlap. Then, looping over alignment 2,3,…etc you can test if the alignment is in the same contig, same strand, and overlaps with any of the previously kept alignments. After having done this, filter away alignments where keep is FALSE. This is the ‘pruned’ table. How many alignments are you left with in the two cases?

5.4.4 Exercise solution

Here is some suggested code for the results from blasting against spades_contigs.fasta, results stored in blast_spades.txt:

library(tidyverse)

spades.tbl <- read_delim("blast_spades.txt", delim = "\t",
                         col_names = c("qseqid", "sseqid", "pident", "length", "mismatch",
                                       "gapopen", "qstart", "qend", "sstart", "send", "evalue", "bitscore")) %>%
  filter(evalue < 1e-05) %>%
  mutate(strand = if_else(sstart < send, "positive", "negative")) %>%
  mutate(keep = T) %>%
  arrange(desc(bitscore))
for(i in 2:nrow(spades.tbl)){
  tbb0 <- slice(spades.tbl, i)
  tbb <- spades.tbl %>%
    slice(1:(i-1)) %>%
    filter(keep) %>%
    filter(sseqid == tbb0$sseqid & strand == tbb0$strand)
  if(nrow(tbb) > 0){
    n1 <- sum((tbb0$strand == "positive") & (tbb0$sstart <= tbb$send) & (tbb0$send >= tbb$sstart))
    n2 <- sum((tbb0$strand == "negative") & (tbb0$sstart >= tbb$send) & (tbb0$send <= tbb$sstart))
    if(n1+n2 > 0){
      spades.tbl$keep[i] <- F
    }
  }
}
spades_pruned.tbl <- spades.tbl %>%
  filter(keep)

Identical code can be used for the canu results by replacing "spades" by "canu" (assuming the results are in blast_canu.txt). This may be solved in many ways, you may have better solutions.

This results in 449 alignments for spades and 527 for canu (number of rows in ‘pruned’ tables). The top ABR genes are more or less the same in both cases, but they get much larger bitscore values for the spades contigs. Even if Nanopore data give us good assemblies (with canu) in terms of few and long contigs, they seems to contain so much more (sequencing) errors that finding particular genes is not as easy as for the Illumina based contigs (with spades).

6 Searching with reads

From the exercise above we notice that the number of hits to ABR genes depends on the contigs we searched, i.e. contigs from Illumina + spades produce different results than contigs from Nanopore + canu. This is unfortunate. Could we simply avoid the entire assembly? The answer is yes, we may search for the ABR genes directly among the reads.

6.1 The `diamond` software

The diamond software is designed to do a search similar to BLAST, but coded in a way that makes it useful for searching directly with the reads. The search is reversed compared to what we did above:

We make a database from the protein sequences we are interested in
And then use every read as a query

Notice that diamond can only search against proteins, while BLAST can be used for all kinds of sequences. In principle, we could also do a search like this with BLAST, using blastx, but it will be slow. The diamond is an implementation that speeds up the search in such cases, but the results we get are in principle the same as if we used BLAST.

Let us now make the shell script where we

Make a diamond database from the protein sequences we translated above
Search with the Illumina reads
Inspect and clean the results in R

Here is some code you put into a shell script in your module5 folder:

##############
### Settings
###
threads=10
ABR_proteins=$SCRATCH/blast/ABR_genes.fasta
r1=$COURSES/BIN310/module2/Illumina_MiSeq_R1.fastq.gz
r2=$COURSES/BIN310/module2/Illumina_MiSeq_R2.fastq.gz
out_dir=$SCRATCH/diamond
dbase=$out_dir/ABR_proteins
out_file1=diamond_results_R1.txt
out_file2=diamond_results_R2.txt
if [ ! -d $out_dir ]
then
  mkdir $out_dir
fi

#####################
### Making database
###
apptainer exec $HOME/module5/diamond:2.1.8--h43eeafb_0.sif diamond makedb \
--threads $threads --in $ABR_proteins --db $dbase

###########################
### Searching with R1 file
###
apptainer exec $HOME/module5/diamond:2.1.8--h43eeafb_0.sif diamond blastx \
--threads $threads --db $dbase --query $r1 --out $out_file1 --outfmt 6

###########################
### Searching with R2 file
###
apptainer exec $HOME/module5/diamond:2.1.8--h43eeafb_0.sif diamond blastx \
--threads $threads --db $dbase --query $r2 --out $out_file2 --outfmt 6

Add the proper heading, use the same memory etc as for blast, but notice we now use 10 threads. You of course also need a diamond container file, download as usual.

We make use of the fasta file with proteins we made by our R script above. Note we use the command blastx for searching here, since the queries are DNA, but the database is protein. This is exactly the same we would have done with blast. Also note we have the same outfmt 6 as for blast. Look up the Help text for diamond to see all options and possibilities. We also make separate searches for the R1 and R2 files, and get separate result files.

Let us read the results into R:

library(tidyverse)

diamond.tbl <- read_delim("diamond_results_R1.txt", delim = "\t",
                          col_names = c("qseqid", "sseqid", "pident", "length", "mismatch",
                                        "gapopen", "qstart", "qend", "sstart", "send", "evalue", "bitscore")) %>% 
  filter(evalue < 1e-05)
diamond.tbl <- read_delim("diamond_results_R2.txt", delim = "\t",
                          col_names = c("qseqid", "sseqid", "pident", "length", "mismatch",
                                        "gapopen", "qstart", "qend", "sstart", "send", "evalue", "bitscore")) %>% 
  filter(evalue < 1e-05) %>% 
  bind_rows(diamond.tbl)

Here we again filtered by evalue, but still the diamond.tbl has many rows! Remember now we have searched with hundreds of thousands of reads, and many reads will match the same protein. It makes sense to immediately filter by E-value, and we could even be stricter than 1e-05, perhaps 1e-10?

6.1.1 Exercise - diamond results

Make an R script using the code above to read the results.

In these results you will find that one and the same read may match several ABR proteins. This is not unexpected, since we presume many of the ABR proteins are similar in some ways. Filter the table such that only the best alignment for each read is kept, i.e. the one with the largest bitscore. How many alignments (= reads) are left?

Next, for each ABR gene (sseqid) compute the readcount, i.e. how many reads aligns to it. Hint: Use group_by() and then summarise(readcount = n()) and store this as readcount.tbl.

You should now have a readcount value for each of the ABR genes. But, the genes are of different lengths and a longer gene will naturally be matched by more reads. We should actually compute the read coverage for each gene. We know all reads are 301 bases long. We need the length of each gene.

Read the fasta file with the ABR genes. Compute the length of each sequence. If you read the protein file, remember to multiply the length by 3 since we want length as the number of bases. Also, in order to join this to the readcount.tbl we need the first token in the Header for every gene. Use the word() function to find this and put it into a new column named sseqid. Then, join this table to the readcount.tbl by sseqid. Finally, compute the coverage for each gene. As a first approximation, use the same formula as we used for genomes, but where now each gene is the ‘genome’.

Arrange the table by coverage, and make a histogram of all coverage values. We know from earlier that the read coverages should be around 60-70, but with a rather large variation (see exercise above).

How many ABR genes have coverage in the range 60-100? How about those genes with a very high coverage, what could these be?

6.1.2 Exercise solution

library(tidyverse)
library(microseq)

diamond.tbl <- read_delim("diamond_results_R1.txt", delim = "\t",
                          col_names = c("qseqid", "sseqid", "pident", "length", "mismatch",
                                        "gapopen", "qstart", "qend", "sstart", "send", "evalue", "bitscore")) %>%
  filter(evalue < 1e-10)
diamond.tbl <- read_delim("diamond_results_R2.txt", delim = "\t",
                          col_names = c("qseqid", "sseqid", "pident", "length", "mismatch",
                                        "gapopen", "qstart", "qend", "sstart", "send", "evalue", "bitscore")) %>%
  filter(evalue < 1e-10) %>%
  bind_rows(diamond.tbl)
diamond.tbl <- diamond.tbl %>%   # only best match for every read
  group_by(qseqid) %>%
  slice_max(bitscore)
readcount.tbl <- diamond.tbl %>%  # the readcounts for each ABR gene
  group_by(sseqid) %>%
  summarise(readcount = n()) %>%
  arrange(desc(readcount))
abr.tbl <- readFasta("/mnt/courses/BIN310/module5/ABR_genes.fasta") %>%  # the actual genes, and their lengths
  mutate(sseqid = word(Header, 1)) %>%
  mutate(length = str_length(Sequence)) %>%
  select(sseqid, length)
coverage.tbl <- readcount.tbl %>%             # the coverages
  left_join(abr.tbl, by = "sseqid") %>%       # left_join from readcount.tbl
  mutate(coverage = readcount * 301 / length)
ggplot(coverage.tbl) +
  geom_histogram(aes(x = coverage))

How can some genes have a very high coverage here? We match reads to ABR genes, and for each read we only keep the best alignment. Still we observe coverages above 200 here. Could it be that these genes are on a plasmid? Plasmids tend to come in several copies, and thus more reads. Or it could be the gene occurs multiple times in the chromosome itself. Genes that are often transferred between bacteria may end up in more than one copy in a genome.

BIN310 module 5 - Alignments

Lars Snipen

2023

1 Learning goals

1.1 Software used in this module

2 Mapping short reads

2.1 The Burrows-Wheeler transform

2.2 The `bowtie2` software

2.3 The `samtools` software

2.3.1 Exercise - plot and inspect read coverage

2.3.2 Exercise solution

3 Reading software documentation

4 Mapping long reads

4.0.1 Exercise - Nanopore read coverage

4.0.2 Exercise solution

5 Searching for specific genes

5.1 BLAST

5.2 The `blast` software

5.3 Running R code inside shell scripts

5.4 The BLAST search and results

5.4.1 Exercise - ABR genes in genome

5.4.2 Exercise solution

5.4.3 Exercise - refine the results

5.4.4 Exercise solution

6 Searching with reads

6.1 The `diamond` software

6.1.1 Exercise - diamond results

6.1.2 Exercise solution

BIN310 module 5 - Alignments

Lars Snipen

2023

1 Learning goals

1.1 Software used in this module

2 Mapping short reads

2.1 The Burrows-Wheeler transform

2.2 The bowtie2 software

2.3 The samtools software

2.3.1 Exercise - plot and inspect read coverage

2.3.2 Exercise solution

3 Reading software documentation

4 Mapping long reads

4.0.1 Exercise - Nanopore read coverage

4.0.2 Exercise solution

5 Searching for specific genes

5.1 BLAST

5.2 The blast software

5.3 Running R code inside shell scripts

5.4 The BLAST search and results

5.4.1 Exercise - ABR genes in genome

5.4.2 Exercise solution

5.4.3 Exercise - refine the results

5.4.4 Exercise solution

6 Searching with reads

6.1 The diamond software

6.1.1 Exercise - diamond results

6.1.2 Exercise solution

2.2 The `bowtie2` software

2.3 The `samtools` software

5.2 The `blast` software

6.1 The `diamond` software