Sequence alignments should be a familiar topic from introductory bioinformatics. Most people have used BLAST to search with some sequence against some database. In this module we will have a look at various forms of alignment methods related to microbial genomics, and how we can use them on our HPC.
In this module we first focus on the mapping of reads to some reference genome. The aim is to
.sam
-files we get from such
read-mapping, to compute read coverages.We also find place in this module to look at how we can search for specific features (genes) in our assembled contigs or by using the reads directly. This is useful whenever you are looking for something specific in your sequenced genomes. Further aims:
Mapping reads to some reference means we take each sequenced read and try to find a place on some reference sequence where it aligns. Typically, the reference sequence is a genome already assembled from some other data, while the reads have been sequenced from a new genome, being very similar to the reference. We may then find
Another example of the use of read mapping is when we Use our assembled contigs as the reference, and we map the same reads to these to get the exact read coverage for each contig. This may be used to evaluate contigs. If we have sequenced a single genome, we expect all contigs to have more or less the same read coverage.
Thus, all kinds of read mapping requires some reference. Note that a reference does not have to be a full genome. It could be we are only interested in some smaller part of the genome(s) here, even a single gene.
Read mapping to some reference genome is perhaps not as commonly used for microbes as it is in eukaryotic genomics. The reason is the difference in genomes. When sequencing the genome of a fish, plant or animal, one individuals genome is extremely similar to all the other individuals of the same species. For humans we have The human reference genome, and all persons’ genomes are extremely similar to this. Using this as the reference, we expect that virtually all reads from the new person will match somewhere, and most of them matches perfectly. Some places we may find Single Nucleotide Variants (SNVs) or other small genotype deviations from the reference that may be potentially important for the phenotype of the new person. In eukaryotes ‘all’ individuals from the same species share the same genes, and the differences in phenotype are largely due to small deviations in the regulatory parts of the genome. This is true even across closely related species. Read mapping makes sense since the ‘average genome’ will always be a very good reference for any single genome.
In the bacterial world, however, the strains are the ‘individuals’, and even between strains of the same species there may be huge genotype differences. As an example, between two strains of E. coli there can be as much as 20% of the genes present in one strain lacking completely in the other! Thus, finding a proper reference to map against is not as straightforward, since other genomes of the same species may be quite different from the one we have sequenced. Also, one may question the relevance of small SNV variants we may find if there are at the same time huge differences involving entire genes.
Anyway, we should be able to do some readmapping, i.e. aligning reads
to (pieces of) genomes, and understand some of the basics behind it.
Assume we have done some Illumina sequencing, and we have a set of many short and precise reads. We would like to align these to some reference, i.e. finding where they match.
You should be familar with alignments from a basic course in bioinformatics, e.g. by the use of dynamic programming or BLAST. Aligning hundreds of thousands of short reads to a long reference sequence sounds like an enormous computer job even for BLAST. It is, and if you try to use BLAST for this, it would take quite some time. Note that BLAST will work, but it is slow. Also, with short and precise reads we would be stricter than BLAST. BLAST looks for local similarities, but here we expect to see that
BLAST is not optimized to handle this special situation. BLAST is an excellent tool for searching with a protein sequence in a database of proteins (we will see later in thsi module). The problem we are faced with here is different:
In the following we will focus on exact matching. The Burrows-Wheeler and FM-indexing is best explained in this setting, but in practice it has been extended to cope with a small number of mismatches as well. If a read has many mismatches, it will not align with this approach, but this is fine actually. Such reads are probably full of errors, and cannot be trusted anyway.
Ben Langmead, that we saw in some previous videos, has also written some texts for his online course.
Here is one such text, which is in my eyes a very good introduction to Burrows-Wheeler and the FM-index. You should take some time reading through this on your own.
Here is my video lecture over this topics building on what Ben Langmead writes.
bowtie2
softwareLet us use the bowtie2
software to do some read
mapping.
In module 4 we assembled reads, and used quast
to
compare the contigs to a reference genome. This reference genome should
in this particular case actually be identical to the genome that was
sequenced, not just some near relative. Let us now map the reads
directly to this genome. We will then count how many reads cover each
position in the genome, i.e. the read coverage at each single position.
Remember from a previous module how read coverage should, in an ideal
situation with no sequencing errors etc, be a Poisson variable. Let us
now see to what extent this is the case here.
First, download the newest container for bowtie2
from
galaxy, and save in your module5
folder. NB! It is
not the bioconductor-container (that may pop up on the galaxy
website) you need, but one named only bowtie2
plus some
version numbering.
We need to create the database or index containing the reference. Let us start building the shell script:
#!/bin/bash
#SBATCH --nodes=1 # We always use 1 node
#SBATCH --reservation=BIN310 # For BIN310 course only
#SBATCH --account=bin310 # For BIN310 course only
#SBATCH --ntasks=10 # The number of threads reserved
#SBATCH --mem=30G # The amount of memory reserved
#SBATCH --time=01:00:00 # Runs for maximum this time
#SBATCH --job-name=bowtie2 # Sensible name for the job
#SBATCH --output=bowtie2_%j.log # Logfile output here
##############
### Settings
###
threads=10
ref=$COURSES/BIN310/module4/Bacillus_cereus_ATCC14579.fasta
out_dir=$SCRATCH/bowtie2
index=$out_dir/bowtie2_index
####################################
### Testing if $out_dir exists
### If yes, delete all files in it
### If no, create it
###
if [ ! -d $out_dir ]
then
mkdir $out_dir
fi
######################
### Building index
###
apptainer exec $HOME/module5/bowtie2:2.5.1--py39h6fed5c7_2.sif bowtie2-build \
--threads $threads $ref $index
Copy this into a shell script, edit and save in your
module5
folder.
Again we define an $out_dir
under our
$SCRATCH
, but unlike the software we have used before,
bowtie2
will not create this folder if it doesn’t exist. We
have to create it ourselves. In the script above we use some more shell
coding, an if-statement. Let us have a short look at this, for more
details see LinuxCommand.org - Flow Control part 1 .
You should be familiar with the if-else concept from coding in R,
python or other languages. Above is the syntax for bash
. We
start with if
and an expression, and the code to
(potentially) be executed is between then
and
fi
. The statement [ ! -d $out_dir ]
is an
expression that tests for the non-existence of the folder
$out_dir
, resulting in a logical TRUE or FALSE. If TRUE,
i.e. $out_dir
does not exist, we create it by
mkdir $out_dir
. In general, the expression
[ -d $out_dir ]
test if $out_dir
exists, and
with the exclamation mark [ ! -d $out_dir ]
we negate the
expression, testing if it does not exist. We could have used an
else-branch as well, like this:
if [ -d $out_dir ]
then
rm $out_dir/*.*
else
mkdir $out_dir
fi
This should read: If the folder $out_dir
exists, delete
all files in it, if not create it. We will probably see similar
if-statements again later, and
The command for building the index is bowtie2-build
, and
we need of course to supply a reference (fasta-file) and give the index
some name ($index_name
). If you run this now, you will see
that in $out_dir
several files pop up, but they all have
the same prefix. When referring to the index later, we simply use this
prefix, i.e. the text in $index
.
Let us add some more code to the script to also do the read mapping:
##############
### Settings
###
r1=$COURSES/BIN310/module2/Illumina_MiSeq_R1.fastq.gz
r2=$COURSES/BIN310/module2/Illumina_MiSeq_R2.fastq.gz
sam_file=$out_dir/illumina_genome.sam
###########################
### Mapping reads to index
###
apptainer exec $HOME/module5/bowtie2:2.5.1--py39h6fed5c7_2.sif bowtie2 \
-q --sensitive --threads $threads --minins 0 --maxins 1000 -x $index -1 $r1 -2 $r2 -S $sam_file
In the command line we use the option --sensitive
. You
may adjust bowtie2
to be fast or sensitive, see the Help
text for more details on this. The options --minins
and
--maxins
sets the outer limits of the inset size or
fragment length. Since we have paired-end reads, the reads in a pair
will map within these limits if correctly mapped. By default the limits
are 0 and 500 bases, but we set the upper limit to 1000 bases here
because these data have quite long insert sizes. This mainly affects how
bowtie2
reports the aligning, not so much the alignment
rate. The output from this is a sam-file, which is a text file
on the sam-format, and should have file ending .sam
. We
will have a look at this below.
Adding this to the script, and running, it should finish within a minute or even less.
samtools
softwareThe sam-format is a common format for the output from read mapping
software. We need a brief insight into this format. A .sam
file produced by an aligner software (like bowtie2
) is a
text file with
@
.Have a look at this document describing the sam format. We focus only on the alignment section, described in section 1.4 of that document.
There is a family of software tools made to process such files, and
many of them are found in the samtools
software package.
Let us see how we can use this to compute the read coverage for every
single position along the reference genome, based on the output in the
$sam_file
we specified above. Instead of making a new
script for this, we add the following code to the bowtie2
script from above.
First, we need a container for samtools
, download and
save this as usual. In the Terminal, we run
apptainer exec samtools:1.17--hd87286a_1.sif samtools --help
just to see if samtools
is a valid command, and indeed
it is. We get listed a set of sub-commands. To compute the read coverage
e need to run:
samtools view
in order to convert our sam-file to a
bam-file. The latter is a compressed version of the sam-file, and
required for the next sub-command.samtools sort
of the bam-file. This will re-arrange its
content such that aligned reads are sorted by where they align on the
reference. This is required by the next sub-command.samtools depth
of the sorted bam-file, to produce a
three-columns text file with the read coverage for every single position
of the genome. This is our final output.When running several commands like this, we typically use piping in UNIX. Instead of storing the output from one command in some file, and then feed this as input to the next (we could do that), we pipe the commands together. Here is an example:
##################
### Settings
###
tbl_file=coverage_illumina.txt
###########################
### Processing .sam output
###
apptainer exec $HOME/module5/samtools:1.17--hd87286a_1.sif samtools view -@ $threads -b $sam_file | \
apptainer exec $HOME/module5/samtools:1.17--hd87286a_1.sif samtools sort -@ $threads | \
apptainer exec $HOME/module5/samtools:1.17--hd87286a_1.sif samtools depth -aa > $tbl_file /dev/stdin
Make certain you understand what we do here:
samtools view -b $sam_file
in order to take
as input the $sam_file
and convert it to bam-format. This
is then piped into…samtools sort
, that will sort everything according to
position on reference. This is then piped into…samtools depth
, where we compute the depth (=read
coverage) on all positions (option -aa
), also those where
no reads may have mapped. This is then finally re-directed
(>
) to a file (named in $tbl_file
). The
very last /dev/stdin
is needed for the last pipe to
work.Add this code to the bowtie2
script, and re-run
(sbatch
). The file coverage_illumina.txt
should pop up in the module5
folder.
The file coverage_illumina.txt
should now contain 3
tab-separated columns, and one row for every position along the genome.
Read this file into R (use read_delim()
). The first column
should be a text indicating the genome sequence, the second column is
the position the the genome sequence, and the third column the read
coverage. It is probably a good idea to name the columns by using
col_names = c("Genome_sequence", "Position", "Coverage")
as
an argument in read_delim()
.
This genome has one chromosome and one plasmid, i.e. there are two
different genome sequences! Then, for both the chromosome and the
plasmid, count how many times we see the various coverages. Use
group_by()
to group the table rows by their genome sequence
and coverage values, and then use summarise(count = n())
to
compute the counts. This means you for each genome sequence count how
often we observe coverage 0, 1, 2,…etc. Make a bar plot of this
(geom_col()
). Compare to the theoretical (Poisson)
distribution we saw in module 2. What is the read coverage of the
chromosome here (approximated from the figure)?
Next, inspect the coverage values again. Are there positions in this genome with read coverage below 10, and even below 1? The latter are parts of the genome that has never been sequenced. How does this compare to the theoretical results around read coverage we looked at in module 2?
library(tidyverse)
coverage.tbl <- read_delim("coverage_illumina.txt", delim = "\t",
col_names = c("Genome_sequence", "Position", "Coverage"))
coverage.tbl %>%
group_by(Genome_sequence, Coverage) %>%
summarize(Count = n()) -> illumina_count.tbl
fig1 <- ggplot(illumina_count.tbl) +
geom_col(aes(x = Coverage, y = Count)) +
facet_wrap(~Genome_sequence, ncol = 1, scales = "free")
print(fig1)
coverage.tbl %>%
filter(Coverage < 1) -> not.sequenced.tbl
cat("There are", nrow(not.sequenced.tbl), "positions with read coverage zero\n")
# Inspect the table not.sequenced.tbl to see which Positions these are.
Let us stop here for a few minutes and have a closer look at how you should start to learn how to use various software on your own. So far you have more or less just copied shell script codes from these documents. In real life you will need to build your scripts from scratch, using software we have yet not seen. To be able to do this, it is a good help in understanding the documentation given by the software.
Let us now replace the paired-end short reads from Illumina sequencing with the long Nanopore reads. How does this change the problem of read mapping?
The long reference sequence(s) is still the same. However, the reads are now fewer, longer and with a lot more errors. This means the Burrows-Wheeler idea looses much of its power. Instead of looking for many and almost perfect matches, we must align long reads to long references, and expect quite many mismatches in these alignments.
In this situation a popular approach is seed-and-extend. This is actually quite similar to the original idea behind the BLAST algorithm, see BIN210. It means, when comparing two rather long sequences, we first look for shorter regions they both share, and then we try to build an alignment between them by extending around these regions.
One software using this idea is the minimap2
we will use
below, and the paper behind it is found here: https://pubmed.ncbi.nlm.nih.gov/29750242/. Download the
container for this into your module5
folder in the usual
way.
Use the minimap2
to map the reads in
$COURSES/BIN310/module4/Nanopore.fastq.gz
to the exact same
reference genome as we used with bowtie2
above.
Use the help text (from
apptainer exec <container> minimap2 -help
) to try to
create the required command line. Hints:
bowtie2
script and edit this.Usage:
line tells you the structure of the command
linebowtie2
you do not need to build any index
here.Use the same samtools
code as in the
bowtie2
script to compute read coverages from this
sam-file, and output to a file named
coverage_nanopore.txt
.
Use the R script you made above to plot and inspect the coverages for these data as well, and compare to how it looked like for the Illumina data. It is perhaps best to compare only the coverages on the chromosome, since the plasmid is short and of unknown copy number really. What seem to be the coverage? How about very low coverages?
#!/bin/bash
#SBATCH --nodes=1 # We always use 1 node
#SBATCH --ntasks=10 # The number of threads reserved
#SBATCH --mem=30G # The amount of memory reserved
#SBATCH --partition=smallmem,hugemem # For < 100GB use smallmem, for >100GB use hugemem
#SBATCH --time=01:00:00 # Runs for maximum this time
#SBATCH --job-name=minimap2 # Sensible name for the job
#SBATCH --output=minimap2_%j.log # Logfile output here
#SBATCH --exclude=cn-12
##############
### Settings
###
threads=10
reads=$COURSES/BIN310/module4/Nanopore.fastq.gz
ref=$COURSES/BIN310/module4/Bacillus_cereus_ATCC14579.fasta
out_dir=$SCRATCH/minimap2
sam_file=$out_dir/nanopore_genome.sam
if [ ! -d $out_dir ]
then
mkdir $out_dir
fi
######################
### Running software
###
apptainer exec $HOME/module5/minimap2:2.26--he4a0461_1.sif minimap2 \
-x map-ont -t $threads -a $ref $reads > $sam_file
##################
### Settings
###
tbl_file=coverage_nanopore.txt
###########################
### Processing .sam output
###
apptainer exec $HOME/module5/samtools:1.17--hd87286a_1.sif samtools view -@ $threads -b $sam_file | \
apptainer exec $HOME/module5/samtools:1.17--hd87286a_1.sif samtools sort -@ $threads | \
apptainer exec $HOME/module5/samtools:1.17--hd87286a_1.sif samtools depth -aa > $tbl_file /dev/stdin
The -x map-ont
indicates the type of input data and what
to do with these. The map-ont
means map Nanopore data
(ONT=Oxford Nanopore Technology). You may also use minimap2
for other types of data, even the short read Illumina data
(-x sr
). See the Help text to learn more about other data
types and what you may do to them.
The option -a
produces as .sam
file output,
which is not default for this software.
Note there is no explicit step for building an index, like we saw for
bowtie2
. We simply feed both the reference and the reads as
input. Note also the redirect >
of the output. We could
have used the option -o $sam_file
instead, but this is an
alternative you often see. With no output specified, most software will
output to what is called standard out /dev/stdout
, and this
usually means to the screen. Thus, the redirect
> $sam_file
will send the output into the specified file
instead.
Assume we have sequenced and assembled one or several genomes, and we have a set of contigs. In some cases we are interested in finding specific features in these genome(s), e.g. does it contain some antibiotic resistance genes or some other features we may be interested in. Let us have a look at some ways to do this. At the same time we will also learn how to combine R code within our shell code.
In basic bioinformatics we learn about sequence alignments, i.e. how we align two or more sequences in order to compare them. This typically results in some alignment-score indicating how similar or not two sequences are. Perhaps the most widely used tool for sequence alignment is BLAST (Basic Local Alignment Search Tool). We use BLAST to
BLAST computes pairwise alignment, i.e. our query sequence is matched to one database sequence at a time, their similarity is computed, and this is repeated for many (all) database sequences. Let us again visit Rob Edwards on YouTube, and hear what he has to say about BLAST. Most of this should be familiar stuff from basic bioinformatics, but have a look, if nothing else for the sake of repetition:
blast
softwareYou have probably all used BLAST on the web (e.g. at https://blast.ncbi.nlm.nih.gov/Blast.cgi). But BLAST can also be installed on your computer, see BLAST+ executables at NCBI for downloading and installing. We have also done this in BIN210.
Why do we want a local version of BLAST? If you have a single
sequence, and wonder what this is, you typically use the web-version of
BLAST, as mentioned above. However, if you want to compare some (many)
sequences to some (many) other sequences you have, the local
blast
software is handy. You may also download the
huge databases from NCBI, and search in these locally. The
problem is the size. It takes a long time to download, and requires a
lot of memory.
Let us use as example the above mentioned antibiotic resistance
genes. We have a selection of known antibiotic resistance genes in the
fasta-file $COURSES/BIN310/module5/ABR_genes.fasta
. This
has been downloaded from the
CARD database. In module 4 we assembled some reads into contigs. Let
us use the contigs we got from using spades
on the Illumina
reads. You may use your own, or a version found in
$COURSES/BIN310/module5/spades_contigs.fasta
.
We want to use the ABR genes as query sequences in a search against the contigs. The genes are all coding genes. Then we should always translate them into protein sequences, and do the search using these. The reason is protein similarity is easier to detect than nucleotide similarity, due to the 20 amino acids and their more detailed scoring regime (recapitulate this from BIN210 if needed).
Since the database (contigs) are nucleotides we need to use the
tblastn
version of BLAST. This means BLAST will translate
the contigs into all six reading frames, and compare to our query
proteins. Thus, we look for regions in the contigs having good
matches to our query proteins. If we find such regions of some length,
it may be an ABR gene, even if some more verification is usually
needed.
Let us now
tblastn
You need two fasta-files for any blast
session, one with
the database sequences and one with queries. In both cases, sequences
are identified by the first token in the Header-line of the
fasta-files. The first token is the first ‘word’ in the Header-line,
i.e. before the first blank. Thus, it is vitally important that
all fasta-files used in any blast
ing always have
Header-lines where the first token is unique to every sequence.
Otherwise, you will not be able to distinguish between sequences in the
output!
Download a blast
container as usual. Note that there are
many containers at galaxy with blast
in its name! It may be
a good idea to search for (CTRL-f) the text ‘blast:2.14’ to get to the
newest containers.
Let us build a BLAST database from the contigs. One of the tools
included in blast
is makeblastdb
, which is the
one we use to build a BLAST-database from a fasta-file. Here is some
shell code for this:
#############
### Settings
##
threads=1
contigs_file=$COURSES/BIN310/module5/spades_contigs.fasta
out_dir=$SCRATCH/blast
dbase=$out_dir/spades_contigs
##########################
### Building the database
###
apptainer exec $HOME/module5/blast:2.14.1--pl5321h6f7f691_0.sif makeblastdb \
-in $contigs_file -dbtype nucl -out $dbase
Copy and edit this code into a proper shell script in your
module5
folder (you also need a blast
container). Notice we only use 1 thread here. Reserve 10GB of memory and
set the wall time to 1 hour.
Then sbatch
and inspect the $out_dir
. A
BLAST database consists of a number of files with different extensions,
and the text you gave to $dbase
as prefix. As usual, if you
want to learn more on how to use makeblastdb
, open a
terminal window and
apptainer exec <container> makeblastdb -help
The blast
software is actually quite liberal with
respect to the ordering of the input.
Our antibiotic resistance genes are in DNA, and we need to translate
these into protein sequence before we use them as query in the BLAST
search. There are several way of doing this, e.g. a nice tool is
seqkit
that you will find a container for at galaxy.
However, let us use this opportunity to do the translation in R, and
then see how we may run our own R code inside a shell script!
Inside a shell script, we run R code using the command
Rscript
, with this template
module load R/4.3.1
Rscript -e "<put R code here>" $input1 $input2...etc
First, we use the option -e
to indicate that the code
now follows inside the double quotes "..."
. Alternatively,
you could just put the same code in an ordinary R script and then just
replace the -e "<put R code here>"
part by the name
of that file. If the code is long, this is probably a good idea, since
it easier to debug a standard R script, but in this example we will type
the code directly in here.
Please note that the R code must be inside the double quotes. This
means the R code itself cannot contain double quotes! This is no
problem, since single quotes, e.g. 'this is a text'
, works
just as fine as double quotes ("this is as text"
) in R.
Remember to use single quotes in your R code here now!
We then extend our shell script from above with the following code:
#############
### Settings
###
ABR_dna=$COURSES/BIN310/module5/ABR_genes.fasta
ABR_protein=$out_dir/$(basename $ABR_dna)
#####################
### Running R code
###
module load R/4.3.1
Rscript -e "library(microseq)
args <- commandArgs(trailingOnly = T)
in.file <- args[1]
out.file <- args[2]
readFasta(in.file) %>%
mutate(Sequence = translate(Sequence)) %>%
writeFasta(out.file)" $ABR_dna $ABR_protein
Note how we pass as input the two file names (in correct ordering!).
The function commandArgs()
in R will allow you to input
arguments to R programs. The arguments will here be stored in the object
args
. Then we copy the first element of this into
in.file
and the second into out.file
. At the
end of the code we can see what is delivered as these two input
arguments, namely $ABR_dna
and $ABR_protein
.
The content of these is specified previously in the same file, and we
could of course have typed this directly into the R code as well.
However, it is nice to set the content of such variable one place only.
If you later change them, you only need to edit one place, and the rest
of the code will still work.
Add this to your shell script from above, and sbatch
.
Verify that the file in $ABR_protein
now indeed is a fasta
file with protein sequences (read in into R with
readFasta()
from microseq
).
Let us do the search against the database. There are many
options to all the blast
tools, and we will only use some
of them here. Here are some lines to be added to the shell script:
##############
### Settings
###
out_file=blast_spades.txt
################
### Searching
###
apptainer exec $HOME/module5/blast:2.14.1--pl5321h6f7f691_0.sif tblastn \
-num_threads $threads -query $ABR_protein -db $dbase -out $out_file -outfmt 6
Add to the script, and sbatch
. The
$out_file
will grow as tblastn
fills it with
results, it takes some time to complete.
In the tblastn
command line we used the option
outfmt 6
. This tells tblastn
(and any other
version of blast
) to output a table of results. This
table-format is short and concise, and very nice for further wrangling
in R! There will be one row for each alignment, and 12 columns. These
column have no names in the output text file, but if you run
tblastn -help
you can see them listed. Here is some code to
read this into R, and where I have added the column names as specified
by the Help text:
blast.tbl <- read_delim("blast_spades.txt", delim = "\t",
col_names = c("qseqid", "sseqid", "pident", "length", "mismatch",
"gapopen", "qstart", "qend", "sstart", "send",
"evalue", "bitscore"))
The column names are:
Column name | Explanation |
---|---|
qseqid | Query sequence ID, the first token in a fasta header |
sseqid | Database (Subject) sequence ID, first token again |
pident | Percent identity in alignment |
length | Length of alignment |
mismatch | Number of mismatches in alignment |
gapopen | Number of gap-opens in alignment |
qstart | Start of alignment on query sequence |
qend | End of alignment on query sequence |
sstart | Start of alignment on database sequence |
send | End of alignment on database sequence |
evalue | The E-value of the alignment |
bitscore | The bit-score of the alignment |
There are many options we could have used in the search. Here are some of the more important ones:
-word_size
, -gapopen
,
-gapextend
, -penalty
and -reward
are used to change the scoring rules.-evalue
and -perc_identity
are
used to limit output alignment quality.-num_alignment
is the upper limit of how many
alignments are reported per database sequence.The option -outfmt
can be additionally configured. If we
used -outfmt '6 qseqid sseqid evalue bitscore'
we would
have only got 4 output columns instead of the 12 we get by default. Run
tblastn -help
(with the proper apptainer code in front) to
see all details.
Make the shell script from the codes above (copy and paste, edit
container names/paths) and sbatch
. Verify the output
appears where it should. The file blast_spades.txt
will
gradually be filled with results as blast
is searching,
just like we saw with bowtie2
above.
Make an R script, read in the results as shown above, and filter the
results to keep only alignments with evalue
below
1e-5
. Add code to answer these questions:
tblastn
report?distinct()
on the sseqid
columnIn the example above we searched against the contigs from
spades
. We also have a similar file with contigs from
canu
, assembled from Nanopore data for the same genome (in
$COURSES/BIN310/module5/contigs_canu.fasta
). Repeat the
tblastn
search with these contigs instead (good idea to
change the name of $out_file
). Do the results change?
We assume you ran the shell script above, and made output file
blast_spades.txt
and blast_canu.txt
, edit
otherwise:
library(tidyverse)
out.file <- "blast_spades.txt" # results from blasting against spades/illumina contigs
#out.file <- "blast_canu.txt" # results from blasting against canu/nanopore contigs
blast.tbl <- read_delim(out.file, delim = "\t",
col_names = c("qseqid", "sseqid", "pident", "length", "mismatch",
"gapopen", "qstart", "qend", "sstart", "send", "evalue", "bitscore")) %>%
filter(evalue < 1e-05)
cat("Number of unique contigs with matches:", length(unique(blast.tbl$sseqid)), "\n")
cat("Number of unique ABR genes with matches:", length(unique(blast.tbl$qseqid)), "\n")
sum.tbl <- blast.tbl %>%
group_by(sseqid) %>%
summarize(n.match = n()) %>%
arrange(desc(n.match))
head(sum.tbl)
Searching the spades
contigs produced in total 9953
alignments. There are hits in 31 different contigs, and in total 1715
distinct ABR genes produced some hit. When searching the
canu
contigs there are 5414 alignments, all in 1 (big)
contig, and 1224 distinct ABR genes had hits this time.
We have too many hits to believe that all of these are real. It cannot be that many ABR genes in this genome! First, we must remember that many of these ABR genes are very similar to each other, and several different ABR genes will probably align at the same region in the contigs. A BLAST search like this will list them all. Thus, we should actually look for distinct regions in the contigs who have hits against some ABR genes. Second, it may also be the E-value threshold is too liberal here, and we get hits to many regions who are not really ABR genes, just something similar.
The difference in the number of hits between spades-contigs and
canu-contigs is also something to reflect upon. Remember, when we used
quast
we found both sets of contigs gave excellent
reconstructions of the reference genome, close to 100% identity. Still,
this smallish differences result in a rather large difference in hits
here. One possible explanation is that the canu
contigs,
being assembled from error prone Nanopore reads, have more errors in
them and this results in fewer hits. On the other hand, we may also
think that the spades contigs, being assembled from short Illumina
reads, result in the same region being (wrongly) assembled into
different contigs, producing more hits.
Let us dig a little deeper into the results above.
Read the results for both spades_contigs.fasta
and
canu_contigs.fasta
into R as tables. Arrange all the
alignments in each table by bitscore
with the largest at
the top. The bitscore
is the best criterion for ranking the
alignments since a larger score means a longer alignment with fewer
mistmatches/gaps. Inspect the tables. How good, in terms of bitscore,
are the best alignments? What might this tell us about Illumina versus
Nanopore data?
Next, let us ‘prune’ each table as follows: Starting at the top,
eliminate alignment number i
if it overlaps with an
alignment with a larger bitscore. The information you need is in the
columns sseqid
, sstart
and send
.
The sseqid
is the text identifying each contig. The
sstart
and send
are the start and end positions
of the alignment on this contig. Here we must be aware of this:
start
value is smaller than
the corresponding end
value, the alignment is on the
positive strandstart
is larger than the corresponding
end
, the alignment is on the negative strandIt may be a good idea to first create a new column,
strand
, keeping this information.
It may also be a good idea to make another column keep
which is TRUE
for all alignment initially. This we change
to FALSE
for those alignment we want to discard due to
overlap. Then, looping over alignment 2,3,…etc you can test if the
alignment is in the same contig, same strand, and overlaps with any of
the previously kept alignments. After having done this, filter away
alignments where keep
is FALSE
. This is the
‘pruned’ table. How many alignments are you left with in the two
cases?
Here is some suggested code for the results from blasting against
spades_contigs.fasta
, results stored in
blast_spades.txt
:
library(tidyverse)
spades.tbl <- read_delim("blast_spades.txt", delim = "\t",
col_names = c("qseqid", "sseqid", "pident", "length", "mismatch",
"gapopen", "qstart", "qend", "sstart", "send", "evalue", "bitscore")) %>%
filter(evalue < 1e-05) %>%
mutate(strand = if_else(sstart < send, "positive", "negative")) %>%
mutate(keep = T) %>%
arrange(desc(bitscore))
for(i in 2:nrow(spades.tbl)){
tbb0 <- slice(spades.tbl, i)
tbb <- spades.tbl %>%
slice(1:(i-1)) %>%
filter(keep) %>%
filter(sseqid == tbb0$sseqid & strand == tbb0$strand)
if(nrow(tbb) > 0){
n1 <- sum((tbb0$strand == "positive") & (tbb0$sstart <= tbb$send) & (tbb0$send >= tbb$sstart))
n2 <- sum((tbb0$strand == "negative") & (tbb0$sstart >= tbb$send) & (tbb0$send <= tbb$sstart))
if(n1+n2 > 0){
spades.tbl$keep[i] <- F
}
}
}
spades_pruned.tbl <- spades.tbl %>%
filter(keep)
Identical code can be used for the canu
results by
replacing "spades"
by "canu"
(assuming the
results are in blast_canu.txt
). This may be solved in many
ways, you may have better solutions.
This results in 449 alignments for spades
and 527 for
canu
(number of rows in ‘pruned’ tables). The top ABR genes
are more or less the same in both cases, but they get much larger
bitscore values for the spades
contigs. Even if Nanopore
data give us good assemblies (with canu
) in terms of few
and long contigs, they seems to contain so much more (sequencing) errors
that finding particular genes is not as easy as for the Illumina based
contigs (with spades
).
From the exercise above we notice that the number of hits to ABR
genes depends on the contigs we searched, i.e. contigs from Illumina +
spades
produce different results than contigs from Nanopore
+ canu
. This is unfortunate. Could we simply avoid the
entire assembly? The answer is yes, we may search for the ABR genes
directly among the reads.
diamond
softwareThe diamond
software is designed to do a search similar
to BLAST, but coded in a way that makes it useful for searching directly
with the reads. The search is reversed compared to what we did
above:
Notice that diamond
can only search against proteins,
while BLAST can be used for all kinds of sequences. In principle, we
could also do a search like this with BLAST, using blastx
,
but it will be slow. The diamond
is an implementation that
speeds up the search in such cases, but the results we get are in
principle the same as if we used BLAST.
Let us now make the shell script where we
diamond
database from the protein sequences we
translated aboveHere is some code you put into a shell script in your
module5
folder:
##############
### Settings
###
threads=10
ABR_proteins=$SCRATCH/blast/ABR_genes.fasta
r1=$COURSES/BIN310/module2/Illumina_MiSeq_R1.fastq.gz
r2=$COURSES/BIN310/module2/Illumina_MiSeq_R2.fastq.gz
out_dir=$SCRATCH/diamond
dbase=$out_dir/ABR_proteins
out_file1=diamond_results_R1.txt
out_file2=diamond_results_R2.txt
if [ ! -d $out_dir ]
then
mkdir $out_dir
fi
#####################
### Making database
###
apptainer exec $HOME/module5/diamond:2.1.8--h43eeafb_0.sif diamond makedb \
--threads $threads --in $ABR_proteins --db $dbase
###########################
### Searching with R1 file
###
apptainer exec $HOME/module5/diamond:2.1.8--h43eeafb_0.sif diamond blastx \
--threads $threads --db $dbase --query $r1 --out $out_file1 --outfmt 6
###########################
### Searching with R2 file
###
apptainer exec $HOME/module5/diamond:2.1.8--h43eeafb_0.sif diamond blastx \
--threads $threads --db $dbase --query $r2 --out $out_file2 --outfmt 6
Add the proper heading, use the same memory etc as for
blast
, but notice we now use 10 threads. You of course also
need a diamond
container file, download as usual.
We make use of the fasta file with proteins we made by our R script
above. Note we use the command blastx
for searching here,
since the queries are DNA, but the database is protein. This is exactly
the same we would have done with blast
. Also note we have
the same outfmt 6
as for blast
. Look up the
Help text for diamond
to see all options and possibilities.
We also make separate searches for the R1 and R2 files, and get separate
result files.
Let us read the results into R:
library(tidyverse)
diamond.tbl <- read_delim("diamond_results_R1.txt", delim = "\t",
col_names = c("qseqid", "sseqid", "pident", "length", "mismatch",
"gapopen", "qstart", "qend", "sstart", "send", "evalue", "bitscore")) %>%
filter(evalue < 1e-05)
diamond.tbl <- read_delim("diamond_results_R2.txt", delim = "\t",
col_names = c("qseqid", "sseqid", "pident", "length", "mismatch",
"gapopen", "qstart", "qend", "sstart", "send", "evalue", "bitscore")) %>%
filter(evalue < 1e-05) %>%
bind_rows(diamond.tbl)
Here we again filtered by evalue
, but still the
diamond.tbl
has many rows! Remember now we have searched
with hundreds of thousands of reads, and many reads will match the same
protein. It makes sense to immediately filter by E-value, and we could
even be stricter than 1e-05
, perhaps
1e-10
?
Make an R script using the code above to read the results.
In these results you will find that one and the same read may match
several ABR proteins. This is not unexpected, since we presume many of
the ABR proteins are similar in some ways. Filter the table such that
only the best alignment for each read is kept, i.e. the one
with the largest bitscore
. How many alignments (= reads)
are left?
Next, for each ABR gene (sseqid
) compute the
readcount, i.e. how many reads aligns to it. Hint: Use
group_by()
and then summarise(readcount = n())
and store this as readcount.tbl
.
You should now have a readcount value for each of the ABR genes. But, the genes are of different lengths and a longer gene will naturally be matched by more reads. We should actually compute the read coverage for each gene. We know all reads are 301 bases long. We need the length of each gene.
Read the fasta file with the ABR genes. Compute the length of each
sequence. If you read the protein file, remember to multiply the length
by 3 since we want length as the number of bases. Also, in order to
join this to the readcount.tbl
we need the first
token in the Header
for every gene. Use the
word()
function to find this and put it into a new column
named sseqid
. Then, join this table to the
readcount.tbl
by sseqid
. Finally, compute the
coverage for each gene. As a first approximation, use the same formula
as we used for genomes, but where now each gene is the ‘genome’.
Arrange the table by coverage, and make a histogram of all coverage values. We know from earlier that the read coverages should be around 60-70, but with a rather large variation (see exercise above).
How many ABR genes have coverage in the range 60-100? How about those genes with a very high coverage, what could these be?
library(tidyverse)
library(microseq)
diamond.tbl <- read_delim("diamond_results_R1.txt", delim = "\t",
col_names = c("qseqid", "sseqid", "pident", "length", "mismatch",
"gapopen", "qstart", "qend", "sstart", "send", "evalue", "bitscore")) %>%
filter(evalue < 1e-10)
diamond.tbl <- read_delim("diamond_results_R2.txt", delim = "\t",
col_names = c("qseqid", "sseqid", "pident", "length", "mismatch",
"gapopen", "qstart", "qend", "sstart", "send", "evalue", "bitscore")) %>%
filter(evalue < 1e-10) %>%
bind_rows(diamond.tbl)
diamond.tbl <- diamond.tbl %>% # only best match for every read
group_by(qseqid) %>%
slice_max(bitscore)
readcount.tbl <- diamond.tbl %>% # the readcounts for each ABR gene
group_by(sseqid) %>%
summarise(readcount = n()) %>%
arrange(desc(readcount))
abr.tbl <- readFasta("/mnt/courses/BIN310/module5/ABR_genes.fasta") %>% # the actual genes, and their lengths
mutate(sseqid = word(Header, 1)) %>%
mutate(length = str_length(Sequence)) %>%
select(sseqid, length)
coverage.tbl <- readcount.tbl %>% # the coverages
left_join(abr.tbl, by = "sseqid") %>% # left_join from readcount.tbl
mutate(coverage = readcount * 301 / length)
ggplot(coverage.tbl) +
geom_histogram(aes(x = coverage))
How can some genes have a very high coverage here? We match reads to ABR genes, and for each read we only keep the best alignment. Still we observe coverages above 200 here. Could it be that these genes are on a plasmid? Plasmids tend to come in several copies, and thus more reads. Or it could be the gene occurs multiple times in the chromosome itself. Genes that are often transferred between bacteria may end up in more than one copy in a genome.