In this module we aim to
An important step in many analyses of sequencing data is to try to
re-construct the genome(s) we sequenced by assembling the many (short)
reads into longer sequences. We will first have a look at some ideas
behind this, before we do some assembly on orion.
Let us again use YouTube, and have a look at what Rob Edwards has to say about genome assembly:
In the above videos, Rob Edwards mentions four different approaches to genome assembly, the naive, the greedy, the OLC and the De Bruijn graph ideas. What is actually the difference between them? Do you get an impression of why the De Bruijn graph is such a good idea, compared to the others?
Below we will have a look at another online course on YouTube, by Ben Langmead. It is always a good idea to get input from various sources! In the course ADS1, Ben Langmead talks about many different topics, but let us only focus on the videos about assembly here. This is a few years old, but most of it is still valid. He uses the phrases prefix and suffix that we should know: These are parts of a sequence. The prefix of length \(n\) is simply the first \(n\) bases of a sequence. The suffix of length \(n\) is the last \(n\) bases.
The next two videos shows a method that is not actually used, but it may still be a good entrance to later ideas:
Here is the real problem we are faced with when trying to assemble short read data:
Here are the methods we use in real life, and why they work/do not work:
In addition to what the experts above had to tell us, I have also summarize some aspects of the use of the De Bruijn graph idea in a short lecture.
Let us start out by focusing on data as paired-end reads from some
Illumina sequencing platform. Let us try to assemble such reads into
longer sequences, called contigs.
spades
We have already used the software spades
in module 3,
but let us use it for assembly, which is what it is intended for. This
is in many ways the standard software for assembling short read data in
microbiology.
Let us make a shell script with the required code to do the assembly
of the same data (fastq files) we looked at in the previous modules. We
now make use of the same container we downloaded in module 3. In the
code below I assume this is found in the folder
$HOME/module3/
, but you must edit this to fit how
it looks like under your $HOME
! Also, the container file
itself may have a slightly different name:
#!/bin/bash
#SBATCH --nodes=1 # We always use 1 node
#SBATCH --reservation=BIN310 # For BIN310 course only
#SBATCH --account=bin310 # For BIN310 course only
#SBATCH --ntasks=10 # The number of threads reserved
#SBATCH --mem=99G # The amount of memory reserved
#SBATCH --time=05:00:00 # Runs for maximum this time
#SBATCH --job-name=spades # Sensible name for the job
#SBATCH --output=spades_%j.log # Logfile output here
##############
### Settings
###
threads=10
memory=99
r1=$COURSES/BIN310/module2/Illumina_MiSeq_R1.fastq.gz
r2=$COURSES/BIN310/module2/Illumina_MiSeq_R2.fastq.gz
out_dir=$SCRATCH/spades
######################
### Running software
###
apptainer exec $HOME/module3/spades:3.15.5--h95f258a_1.sif spades.py \
-o $out_dir --tmp-dir $TMPDIR --memory $memory --threads $threads --isolate \
--pe-1 1 $r1 --pe-2 1 $r2
#############################
### Copying contigs to $HOME
### and cleaning
cp $out_dir/contigs.fasta spades_contigs.fasta
# add command to delete unneeded stuff here
Notice that here I used the same apptainer
container
that I downloaded in module 3. This I saved as a .sif
file
under my $HOME/module3/
folder. You may have saved it in
another folder and/or with a different file name, please edit the code
accordingly. You could also copy the container file into your
module4
folder. It is not super big, and having multiple
copies is not a huge waste of disk space.
Make a new shell script (in a module4
folder) and copy
the codes above, edit, save and sbatch
to SLURM. Assembly
is a large task compared to what we have seen before, and will take
longer time. Notice we set the wall time (#SBATCH --time
)
to 5 hours here. The log file produced by spades
is very
rich in information, you may inspect this as it proceeds, to get some
idea of the progress.
Notice that in the command line we enter the read-files as
--pe-1 1 $r1 --pe-2 1 $r2
. The integer (1
)
between --pe-1
and $r1
indicates this is data
set 1. You may enter several data sets at once. This is more relevant
for metagenomes, where we have sequenced (almost) the same metagenome
several times, and want to use all reads for assembly (co-assembly). If
you look at the help text for spades
you will also find a
range of other input options. Notice also we set the memory limit to
99GB. Assembly is memory-intensive.
The spades
assembler is based on the De Bruijn graph
idea that was mentioned in the films above. You may notice from the log
file, and the output, that spades
uses K-mers of multiple
lengths. First, it starts by chopping all reads into 21-mers, i.e. \(K=21\), and build a De Bruijn graph from
this. A smallish value for \(K\), like
this, is robust against sequencing errors, but will also be sensitive to
repeated regions, and this typically ends with a large number of rather
short contigs. Then, spades
re-starts the process, using
\(K=33\) and the contigs from the
previous step as ‘extra reads’ helping to build a new De Bruijn graph.
This typically ends in fewer, but longer, contigs. In the
$out_dir
you specified above you will see folder with names
K21
, K33
etc. with output from these
steps.
This process is then iterated for larger values of \(K\). Thus, spades
uses a whole
range of K-mers, and re-builds several De Bruijn graphs, in order to get
the most out of the reads. This has turned out a smart way of
proceeding, and spades
is considered the best choice of
assembler for short read microbial data. It is, however, slow and
requires a lot of memory. For these reasons, other tools are sometimes
used, producing almost as good results but much faster and with less
memory.
The final result we are interested in is the file
contigs.fasta
in the $out_dir
. We may also be
interested in inspecting some of the other output, but we will not dig
into this here. It is mostly for solving problems that may occur.
Usually, the contigs file and the log file is what we look into after a
spades
assembly. In the script above we copy the file
contigs.fasta
to another folder. The reason is we may then
delete the entire $out_dir
folder! Remember to do this,
since spades
produces a lot of output, and the disk will
soon be filled if you keep all this.
--careful
In the above command line we used the option --isolate
.
This indicates to spades
we have data from a single isolate
(genome), with high read coverage. If we have data with too low read
coverage, this may fail, and spades
crashes after some
time. Then, replace --isolate
with --careful
and you should at least get some results.
Edit the script above by replacing --isolate
with
--careful
, change the name of the file with the contigs,
and re-run.
Inspect the log files for both runs. What is the difference with respect to the use of BayesHammer?
It looks like with the --careful
option the BayesHammer
is used before the assembly, but with the --isolate
option
it is not.
In some cases it could be wise to do some pre-processing of
our raw data prior to using spades
or other assembly
software. We should, however, always be a little careful since each time
we do something to our data, there is also a chance we are destroying
something. We have seen how spades
have its own built-in
error-correction (BayesHammer), and it is not obvious that
pre-processing is a benefit.
We saw already in module 2 how we could use the software
fastp
for filtering data based on the fastq quality scores.
Here we will do something similar, using a different software tool.
There is no reason we could not have used fastp
here as
well, but there are multiple tools out there for doing this, and here is
another example of this. In this section we will use the
BBmap
software, and the apptainer
container
for this at galaxy.
The Illumina sequencing technology relies on attaching some adapter sequences to the DNA fragments prior to sequencing. These are technical sequences, and have nothing to do with the organism we sequenced. They should largely be removed by the Illumina machine, but remains of this may still be present in our reads. It makes sense to remove this, since it is a contamination of our data.
We will use the BBduk
part of the BBmap
software to do this. This software needs a file with known adapter
sequences, and will then map K-mers from reads to these in order to
locate their potential presence. In the command line below we also do
some trimming and filtering of the reads, much like we id with
fastp
in module 2.
First, download the newest bbmap
container from galaxy,
and store it in your module4
folder. The procedure is the
same as you used for the other containers above. In the code below I
have just pasted in some container file name, you need to edit
this if it is not the same as you downloaded.
Here is some code to make a script running BBduk
on the
data we have used before, copy and edit:
#!/bin/bash
#SBATCH --nodes=1 # We always use 1 node
#SBATCH --reservation=BIN310 # For BIN310 course only
#SBATCH --account=bin310 # For BIN310 course only
#SBATCH --ntasks=10 # The number of threads reserved
#SBATCH --mem=10G # The amount of memory reserved
#SBATCH --time=01:00:00 # Runs for maximum this time
#SBATCH --job-name=preproc # Sensible name for the job
#SBATCH --output=preproc_%j.log # Logfile output here
#############
### Settings
###
adapters=$COURSES/BIN310/module4/adapters.fa
threads=10
out_dir=$SCRATCH/bbduk
r1_in=$COURSES/BIN310/module2/Illumina_MiSeq_R1.fastq.gz
r2_in=$COURSES/BIN310/module2/Illumina_MiSeq_R2.fastq.gz
r1_out=$out_dir/$(basename $r1_in)
r2_out=$out_dir/$(basename $r2_in)
unpaired_out=$out_dir/unpaired.fastq.gz
###############################
### The trimming and filtering
###
apptainer exec $HOME/module4/bbmap:39.01--h92535d8_1.sif bbduk.sh \
t=$threads in1=$r1_in in2=$r2_in ref=$adapters out1=$r1_out out2=$r2_out outs=$unpaired_out \
ktrim=r k=23 mink=11 hdist=1 tpe tbo
Notice, first we use a bbmap
container at galaxy, but
the command for running the trimming and filtering is
bbduk.sh
, which is part of this container. We will use
another command from the same container below.
The adapters are in a fasta file specified in $adapters
.
This is something you may also copy from the BBmap website (see below).
This file contains a number of known Illumina adapters, and most likely
some of these have been used when sequencing our data.
We output to some folder under $SCRATCH
, it will be
created if not already existing. We do not copy the output files from
this folder to our $HOME
, which is something we could have
done.
The out-file names must also be specified. Again we use the
basename
UNIX command, and re-use the input file names, but
we also need a third output file, for the unpaired reads. Just as we saw
for BayesHammer (module 3) or fastp
(module 2) this step
will also result in some reads losing their mate, and we get a bunch of
unpaired reads in addition to those still in pairs.
The command line itself contains a lot, and I will not dig into this
here now. The complete guide for BBmap is found here:. You may
study this to sort out what the codes in the last line of the command
line means. It is basically about trimming and filtering reads based on
quality scores.
Before we start the spades
assembly, it could be an
advantage to merge reads. Remember, with paired-end sequencing
we sequence both ends of a fragment of DNA. If the sequenced
DNA fragment is short, the paired-end reads will overlap each other.
Assume the fragment is 500 basepairs long, and the reads are both 300
bases. Then the reads will overlap each other by 100 bases, i.e. the
last 100 bases of the R1
reads is the reverse-complement of
the last 100 bases of the R2
read. In such cases it could
be a benefit to actually merge these into one read, i.e. instead of
having two reads both 300 bases long, we get one read which is 500
bases. Not all reads overlap like this, but let us run the paired reads
through BBmerge
to see what happens. Here we now simply
extend the BBduk
script from above, having both steps in
the same script:
#...see script above...
#...then add the following below that...
#############
### Settings
###
out_dir=$SCRATCH/bbmerge # new out_dir
r1=$r1_out # from above
r2=$r2_out # from above
r1_out=$out_dir/$(basename $r1_in)
r2_out=$out_dir/$(basename $r2_in)
merged_out=$out_dir/merged.fastq.gz
#####################
### The merging
###
apptainer exec $HOME/module4/bbmap:39.01--h92535d8_1.sif bbmerge.sh \
t=$threads in1=$r1_in in2=$r2_in out=$merged_out outu1=$r1_out outu2=$r2_out
Make a new shell script in your module4
folder, and add
the code above, and sbatch
the script. You should get three
compressed fastq files in both of the $out_dir
folders we
specified above, and a log file in your module4
folder.
Let us have a look at the results we got above, and discuss if pre-processing is a good idea or not.
Trimming of adapters seems like a good idea, since adapters are artificial sequences we do not want in our reads. The only potential problem may be if there are many adapters to match against, and these are rather short. Then some reads may randomly contain a segment that matches an adapter, but is still ‘real’ DNA. Thus, we may remove something which is actually a valid genomic sequence.
If we inspect the log file from the above run, I find this section:
Input: 1191850 reads 358746850 bases.
KTrimmed: 16038 reads (1.35%) 1131824 bases (0.32%)
Trimmed by overlap: 2476 reads (0.21%) 20532 bases (0.01%)
Total Removed: 256 reads (0.02%) 1152356 bases (0.32%)
Result: 1191594 reads (99.98%) 357594494 bases (99.68%)
From the last two lines I read that almost no reads were discarded, and around 1% of the reads were trimmed, i.e. had something that matched adapters. This sounds good, we do not expect too many reads to have adapters, even if this happens from time to time.
When it comes to the merging, it is obvious that if very few read-pairs are merged, we can do just as well without this step. Scrolling down the log file to the merging, I find this section:
Pairs: 595797
Joined: 268159 45.008%
Ambiguous: 324240 54.421%
No Solution: 3398 0.570%
Too Short: 0 0.000%
Here 45% of the read pairs could be merged. In this case merging makes sense, if this percentage is very low (< 10%) merging will not have any impact.
If the merging percent is very high, e.g. ~90%, you should be aware! It means that the fragments that were sequenced are all quite short since the vast majority of read-pairs can be merged. This is in general not a good thing when it comes to assembly! As mentioned in one of the videos by Rob Edwards above, paired-end reads are useful since they span a larger region of the DNA. Even if we do not sequence the middle part of the fragment, we know the reads in a pair come from the ends of the same fragment/region, and this information is used to resolve the problems with repeats. Having too short fragments, this benefit is almost lost! The two reads in a read pair are now just two versions of the same short region. It is a waste of sequencing resources to do paired-end on such short fragments. In such cases this should be reported to see if it is possible to alter the lab procedures to get longer fragments.
Is merging of the reads really needed before assembly? Given the De Bruijn graph idea, shouldn’t the overlapping K-mers pick up all this information anyway? The short answer is yes. Merging is rarely important for the quality of the final contigs, but there is one more issue: Building the De Bruijn graph takes a lot of memory. We use less memory if we use the merged reads. This is hardly important for single genome assembly, but when we comes to metagenomes, this may be of some value. Perhaps the best argument for including the merging step is we may discover that our reads overlap too much, and that this will most likely lead to some problems in the assembly later, regardless of whether we use merged reads or not as input.
The people behind the spades
software has some recommendations on data pre-processing here
spades
on pre-processed dataNow, edit the spades
script from previously (use the
--isolate
option again) to use the pre-processed data as
input. This means you use as input
R1
file after mergingR2
file after mergingAll these four inputs may be listed, with proper options in
spades
. Have a look at the help text
(spades.py --help
) or the manual
here for how to include the merged and unpaired reads.
Remember to copy the resulting contigs to your module4
folder and give it a different name than those from before
(e.g. spades_contigs_preproc.fasta
). We will later compare
the contigs we got from spades
using both raw and
pre-processed data.
Suggested solution shell script (remember to edit the paths):
#!/bin/bash
#SBATCH --nodes=1 # We always use 1 node
#SBATCH --reservation=BIN310 # For BIN310 course only
#SBATCH --account=bin310 # For BIN310 course only
#SBATCH --ntasks=10 # The number of threads reserved
#SBATCH --mem=99G # The amount of memory reserved
#SBATCH --time=05:00:00 # Runs for maximum this time
#SBATCH --job-name=spades # Sensible name for the job
#SBATCH --output=spades_%j.log # Logfile output here
##############
### Settings
###
threads=10
memory=99
r1=$SCRATCH/bbmerge/Illumina_MiSeq_R1.fastq.gz
r2=$SCRATCH/bbmerge/Illumina_MiSeq_R2.fastq.gz
merged=$SCRATCH/bbmerge/merged.fastq.gz
unpaired=$SCRATCH/bbduk/unpaired.fastq.gz
out_dir=$SCRATCH/spades_preproc
######################
### Running software
###
apptainer exec $HOME/module3/spades:3.15.5--h95f258a_1.sif spades.py \
-o $out_dir --tmp-dir $TMPDIR --memory $memory --threads $threads --isolate \
--pe-1 1 $r1 --pe-2 1 $r2 --s 1 $unpaired --merged $merged
#############################
### Copying contigs to $HOME
### and cleaning
cp $out_dir/contigs.fasta spades_contigs_preproc.fasta
# add command to delete unneeded stuff here
How good are the assemblies we get? In any real situation we do not
know exactly how the sequenced genome actually looks like. We may,
however, have the genomes of some close relative to compare against.
But, in some cases we don’t even have this. Then, what can we do?
An assembly has many facets, and the ‘goodness’ depends partly on what we are looking for. There are, however, some basic quantities we usually compute:
The number of contigs is the easiest. Ideally, if we sequence a single chromosome, a perfect assembler should give us 1 single contig spanning the entire molecule. In reality this never happens (with short read data), but as few contigs as possible seems to be a criterion for goodness. Remember, even a prokaryotic genome is not always 1 DNA molecule. Bacteria frequently have plasmids, who are small ‘extra chromosomes’. These should ideally end up as separate contigs.
On the other hand, if the chromosome is 3 million bases long, we could imagine the following two assembly results:
Most people will say assembly 1 is the best, even if it has most contigs, since we manage to re-construct one very large contig. The many small contigs are not important, simply because they are small. This is the reason for the N50 statistic.
This is computed as follows:
Similarly we may define N75, N90 etc, by changing the threshold to \(0.75\) or \(0.90\).
With respect to coverage, we would ideally expect to see similar values for all contigs, with some minor variance only. If some contigs have much higher coverage, it is an indication of a repeated region, or possibly a plasmid which is often found in multiple copies in each cell. A very low coverage is perhaps an indication of some contigs made from error reads.
If we have a reference genome, i.e. a genome sequence of something very similar to what we now have assembled, we may align the contigs to this. From this we can compute how large fraction of that genome is covered by contigs. We may also detect badly assembled contigs if some part of it comes from a certain region of the reference, but another part from a completely different region on the reference. We should, however, be a little careful here, since bacteria are known to shuffle genomic regions, and even close relatives may have such differences.
Make an R script in your module4
folder where you read
in the file spades_contigs.fasta
that we got from the first
assembly above. Use the readFasta()
function again, to
produce a table with Header
and Sequence
for
each contig. Let us call the table contigs
.
Add columns named Length
and Kmer.coverage
to the table. Both are listed in the Header
of each contig.
These Header
s all follow the pattern
NODE_<X>_length_<length>_cov_<kmer-coverage>
where the three texts inside an < >
are different for
each line. Thus, if we split this text by each underscore
"_"
it consists of 6 words. Word number 4 gives you the
length, and word number 6 the K-mer coverage.
Extract words by using the word()
function in R,
e.g. word number 4 is extracted by
word(Header, 4, 4, sep = "_")
. Remember to also convert
Length
to a number. When extracting it from a text by
word()
you get a text, not a number. Use
as.numeric()
on that text to convert it to a number. The
same applies to Kmer.coverage
.
Compute the N50 value from the data in the Length
column. You need to
Here is a way to start, assuming the table contigs
have
the columns mentioned above
library(tidyverse)
library(microseq)
contigs.tbl <- readFasta("spades_contigs.fasta") %>%
mutate(Length = as.numeric(word(___, ___, ___, sep = "_"))) %>%
mutate(Kmer_coverage = as.numeric(______)) %>%
arrange(___)
Tot.length <- sum(___)
Cum.length <- cumsum(___)
idx <- min(which(___ >= ___ * 0.5))
N50 <- ___[idx]
Make a plot of the K-mer coverage versus the contig length for each contig. The title of the plot should state the total number of contigs, and the N50 value.
library(tidyverse)
library(microseq)
contigs.tbl <- readFasta("spades_contigs.fasta") %>%
mutate(Length = as.numeric(word(Header, 4, 4, sep = "_"))) %>%
mutate(Kmer_coverage = as.numeric(word(Header, 6, 6, sep = "_"))) %>%
arrange(desc(Length))
Tot.length <- sum(contigs.tbl$Length)
Cum.length <- cumsum(contigs.tbl$Length)
contig.number <- min(which(Cum.length >= Tot.length * 0.5))
N50 <- contigs.tbl$Length[contig.number]
fig <- ggplot(contigs.tbl) +
geom_point(aes(x = Kmer_coverage, y = Length)) +
scale_y_log10() + scale_x_log10() +
labs(x = "Kmer coverage", y = "Contig length", title = str_c(nrow(contigs.tbl), " contigs, N50=", N50))
print(fig)
quast
softwareThe software quast
may be used to compute the above
statistics, and a range of other assessments. Especially if we have a
reference genome this is a nice tool. A reference is an already
sequenced genome that is very close to the one we assembled, and by
comparing the assembled contigs to this we get a picture of how good our
assembly is. In the case of the data we have been working with, we have
indeed a very good reference genome. You find it in
$COURSES/BIN310/module4/Bacillus_cereus_ATCC14579.fasta
This is actually the exact same strain that was sequenced! Thus, we
should ideally get 100% identity if the assembly is perfect. For all
details about all the possibilities in using quast
,
consider the quast manual.
Let us first consider the contigs we got from spades
on
the raw data, and run quast
on these. First, download the
newest quast
container from galaxy, and save it in your
module4
folder.
Here is a shell script code you can copy and edit:
#!/bin/bash
#SBATCH --nodes=1 # We always use 1 node
#SBATCH --reservation=BIN310 # For BIN310 course only
#SBATCH --account=bin310 # For BIN310 course only
#SBATCH --ntasks=10 # The number of threads reserved
#SBATCH --mem=10G # The amount of memory reserved
#SBATCH --time=01:00:00 # Runs for maximum this time
#SBATCH --job-name=quast # Sensible name for the job
#SBATCH --output=quast_%j.log # Logfile output here
##############
### Settings
###
contigs=$HOME/module4/spades_contigs.fasta # we copied this to our moduel4 folder when running spades above
ref=$COURSES/BIN310/module4/Bacillus_cereus_ATCC14579.fasta
out_dir=$HOME/module4/quast_spades
threads=10
##################
### Running quast
###
apptainer exec $HOME/module4/quast:5.2.0--py39pl5321h4e691d4_3.sif quast.py \
--threads $threads -r $ref --output-dir $out_dir $contigs
Notice we did not output to some folder under $SCRATCH
this time, and the only reason is we want click on an .html
file in the output folder. Having this folder inside the
module4
folder is then handy.
In RStudio, click into the quast_spades
folder, and
click open the report.html
(View in Web Browser).
Here is a short video discussing the output we see.
Above we also did some pre-processing of the data before we assembled
with spades. Now, edit the quast
script, using these
contigs instead. Also, output to some other folder. Compare the results.
Did the pre-processing of the data make any difference to the assembly
evaluation result?
#!/bin/bash
#SBATCH --nodes=1 # We always use 1 node
#SBATCH --reservation=BIN310 # For BIN310 course only
#SBATCH --account=bin310 # For BIN310 course only
#SBATCH --ntasks=10 # The number of threads reserved
#SBATCH --mem=10G # The amount of memory reserved
#SBATCH --time=01:00:00 # Runs for maximum this time
#SBATCH --job-name=quast # Sensible name for the job
#SBATCH --output=quast_%j.log # Logfile output here
##############
### Settings
###
contigs=spades_contigs_preproc.fasta
ref=$COURSES/BIN310/module4/Bacillus_cereus_ATCC14579.fasta
out_dir=quast_preproc
threads=10
##################
### Running quast
###
apptainer exec $HOME/module4/quast:5.2.0--py39pl5321h4e691d4_3.sif quast.py \
--threads $threads -r $ref --output-dir $out_dir $contigs
We have so far focused on Illumina paired-end short read data. This is still the most common data type in microbial genomics, but things are changing, and we expect to see a lot more long read data in the near future. Especially data from the Oxford Nanopore Technology (ONT) or just Nanopore.
As we already know:
As the sequencing technology improves, we will most likely see data
with gradually longer are more precise reads. Eventually we may hope to
get the entire, and correct, genome sequence directly from the
sequencer, and the whole assembly step is left in the junkyard as
yesterdays science! Well, until we are there we must be capable of
turning Nanopore long reads into assembled genomes. We will do this now,
and compare the results we get to those we got with short reads above.
The raw output from Nanopore sequencing is a set of files in the
fast5 format. These are binary files we cannot inspect
manually, and their format has not always been very well described by
the producers. Depending on the sequencing machine used, there is one
such file for each read, or multi-read fast5 files. From these fast5
files, as base-caller software is used to produce the fastq
files with the reads. Several base-caller softwares are around,
e.g. guppy
, and this is still under development. From this
process a number of reads are already discarded, having too low
precision. A threshold of quality score 7(!) is commonly used, which is
itself very low.
We saw above that for short reads data pre-processing may or may not
be a benefit for the assembly step. For long read data the latter is
perhaps even more so. As an example, adapters may be present,
potentially once in each read. In a set of many and short reads, this
may then be a rather large fraction of the data. For a few long reads,
one short adapter in each of them is an ignorable fraction of the total.
Also, trimming based on sequencing quality, which is sometimes done for
short reads, is less straightforward. The Illumina reads are usually of
very high precision, and we may throw away the odd ‘rubbish’ without
losing too much of our data. The nanopore reads are low precision from
the beginning, and sorting out ‘rubbish’ is very difficult without
losing useful data. We also saw that spades
has a built-in
error-correction step. This is also the case for some of the long-read
assemblers, making any pre-processing even less important. Thus, we will
not devote any attention to pre-processing Nanopore data here.
canu
The canu
software has been built from the age-old Celera
assembler that was made during the Human Genome Project in the
previous century. Note that this is not a De Bruijn graph
assembler! Instead, it uses the idea of Overlap-Layout-Consensus (OLC).
Remember, that the De Bruijn graph idea was adopted when NGS technology
started to produce a huge number of short reads. The OLC strategy worked
fine with the Sanger technology, where reads were long and few. Finding
overlaps for a huge number of NGS reads turned out too computational
expensive. However, with todays long read technologies, we are back
where we started in this sense, and the OLC idea is again a good
approach to assembly.
The canu
is only one of a range of software tools for the
assembly of Nanopore data, see the article https://f1000research.com/articles/8-2138 for a fairly
recent benchmarking of such tools. Also, for an overview of the entire
processing of Nanopore data, see the article https://academic.oup.com/bib/article/20/4/1542/4958758.
We will not dig into these methods in BIN310, but the course BIO326
devotes more time to the full treatment of Nanopore data.
The canu
also starts by error correction of the reads,
and we may feed the fastq file directly to the assembler. Let us run the
canu
on some Nanopore data for the exact same genome that
we have been working on so far. You find the single fastq file with
Nanopore data in
$COURSES/BIN310/module4/Nanopore.fastq.gz
Again, download the newest canu
container from galaxy,
and put it in your module4
folder. Remember to edit the
container file name in the code below accordingly.
Here is some code you copy and edit:
#!/bin/bash
#SBATCH --nodes=1 # We always use 1 node
#SBATCH --reservation=BIN310 # For BIN310 course only
#SBATCH --account=bin310 # For BIN310 course only
#SBATCH --ntasks=10 # The number of threads reserved
#SBATCH --mem=99G # The amount of memory reserved
#SBATCH --time=10:00:00 # Runs for maximum this time
#SBATCH --job-name=canu # Sensible name for the job
#SBATCH --output=canu_%j.log # Logfile output here
##############
### Settings
###
threads=10
memory=99g
genome_size=5m
reads=$COURSES/BIN310/module4/Nanopore.fastq.gz
out_dir=$SCRATCH/canu
prefix=Nanopore
######################
### Running software
###
apptainer exec $HOME/module4/canu:2.2--ha47f30e_0.sif canu \
-d $out_dir -p $prefix genomeSize=$genome_size maxThreads=$threads maxMemory=$memory -nanopore $reads
######################################
### Copying contigs to $HOME
### NB! Do not delete $out_dir yet!
###
cp $out_dir/$prefix.contigs.fasta canu_contigs.fasta
Make a shell script with the above code, and sbatch
it
as before. This will take quite some time to complete, hours rather than
minutes. Note that we set the wall time to 10 hours here.
Notice we specify the maximum number of threads and memory that
canu
should use. It may use less, at least in certain
steps. We also supply a ‘guess’ of the genome size. This is not critical
for the end result, for most bacteria 3 million basepars
(3m
) would probably be close enough. Here we happen to know
the genome is around 5m
. The $prefix
simply
specifies the ‘first’ name of the output files.
Much like spades
, the canu
also produces a
lot of output. The contigs we are most interested in are in the file
$out_dir/$prefix.contigs.fasta
(where $prefix
is replaced by the text we give to this shell variable in our script).
As a by-product we also get corrected and trimmed reads, in the fasta
files $out_dir/$prefix.correctedReads.fasta
and
$out_dir/$prefix.trimmedReads.fasta
.
Once we have the assembly, it is not uncommon to try to polish this, making use of the raw fast5 files to try to correct mistakes in the assembly, see the article https://academic.oup.com/bib/article/20/4/1542/4958758 mentioned above. We will only consider unpolished assemblies here this time, they are still quite good.
Read the Nanopore fastq file into R (use readFastq()
).
In R you cannot use the environment variable COURSES
(not
without some extra code line). Use instead the full path, replace
$COURSES
with /mnt/courses/
. Compute the
length of all reads, and store as the column Length
in the
table. As usual, when creating a new column we use
mutate()
, and compute the lengths from the
Sequence
using str_length(Sequence)
inside
mutate()
).
Then do the same for the fasta files (use readFasta()
)
with both corrected and trimmed reads. You should then have 3
tables.
How many reads are in the corrected/trimmed files compared to the raw file?
What is the length distribution of these reads? Make histograms and
compare. Which reads have been discarded by canu
?
### Here I read from MY SCRATCH disk folder (/mnt/SCRATCH/bin310-23-48/), you should change this to yours...
library(microseq)
fq.raw <- readFastq("/mnt/courses/BIN310/module4/Nanopore.fastq.gz") %>%
mutate(Length = str_length(Sequence))
fq.cor <- readFasta("/mnt/SCRATCH/bin310-23-48/canu/Nanopore.correctedReads.fasta.gz") %>% # edit path!
mutate(Length = str_length(Sequence))
fq.trm <- readFasta("/mnt/SCRATCH/bin310-23-48/canu/Nanopore.trimmedReads.fasta.gz") %>% # edit path!
mutate(Length = str_length(Sequence))
### Putting all lengths into a single table, and with the extra column
### indicating from which file they came.
### Could also make 3 separate plots and compare.
reads.tbl <- fq.raw %>%
select(Length) %>%
mutate(File = "raw")
reads.tbl <- fq.cor %>%
select(Length) %>%
mutate(File = "corrected") %>%
bind_rows(reads.tbl)
reads.tbl <- fq.trm %>%
select(Length) %>%
mutate(File = "trimmed") %>%
bind_rows(reads.tbl) %>%
mutate(File = factor(File, levels = c("raw", "corrected", "trimmed")))
### The last use of factor here is just a trick to make the panels appear
### in the order raw-corrected-trimmed in the plot below
fig <- ggplot(reads.tbl) +
geom_histogram(aes(x = Length), bins = 30) +
facet_wrap(~File, ncol = 1)
print(fig)
canu
procedureThe canu
pipeline is described at https://canu.readthedocs.io/en/latest/tutorial.html
as:
“The canu pipeline, that is, what it actually computes, comprises of computing overlaps and processing the overlaps to some result. Each of the three tasks (read correction, read trimming and unitig construction) follow the same pattern:
A unitig we may think of as simply a unique and corrected/trimmed contig (long read) ready for assembly. Notice that the correction/trimming does not rely on the quality scores, but rather on overlaps. This is quite typical of long-reads. The quality scores are not that useful anymore, but due to their length they should overlap a lot, and reads with too small overlaps are probably rubbish. Thus, short reads are probably rubbish!
We may of course use the contigs from canu
, or any other
assembler, as input to quast
as we did above, and get the
same type of assembly evaluation.
canu
contigsEdit the quast
script from above, and use the contigs
from canu
instead of those from spades
.
Compare to what you got with spades
. It is the same
reference genome.
#!/bin/bash
#SBATCH --nodes=1 # We always use 1 node
#SBATCH --reservation=BIN310 # For BIN310 course only
#SBATCH --account=bin310 # For BIN310 course only
#SBATCH --ntasks=10 # The number of threads reserved
#SBATCH --mem=10G # The amount of memory reserved
#SBATCH --time=01:00:00 # Runs for maximum this time
#SBATCH --job-name=quast # Sensible name for the job
#SBATCH --output=quast_%j.log # Logfile output here
##############
### Settings
###
contigs=canu_contigs.fasta
ref=$COURSES/BIN310/module4/Bacillus_cereus_ATCC14579.fasta
out_dir=quast_canu
threads=10
##################
### Running quast
###
apptainer exec quast:5.2.0--py39pl5321h4e691d4_3.sif quast.py \
--threads $threads -r $ref --output-dir $out_dir $contigs
In some cases we have both short-read and long-read data for the same samples. This is truly luxury. We then have both very precise (short) reads and long-overlapping (long) reads. Combining both data means we do a hybrid assembly.
The spades
has an option for making such hybrid
assemblies, you may add the long-reads by the --nanopore
option. Then, spades
will build contigs from the short
reads as before, and then use the long reads to ‘bind together’ the
resulting contigs into longer contigs. A similar overall approach is
used by unicycler
software, see https://github.com/rrwick/Unicycler#method-hybrid-assembly.
We will not dig into hybrid assembly here, as these type of data are not that common, simply because we need to sequence the same sample twice, making it costly in terms of time and money. My personal experience is also that this rarely improves the assembly very much, at least for single genome sequencing. Often the results are very good already with using only one of the data types, it is difficult to improve something which is very good! For metagenomes, that we will look into later, assemblies are poorer, and then this may make more sense. As the long-read technologies improve the precision, it is also likely that this will be the only data source for genome re-construction in the future.