1 Learning goals

In this module we aim to

Understand the basic ideas behind algorithms for sequence assembly, especially the De Bruijn graph idea.
Be able to pre-process and then assemble both short and long read data, using some standard software tools.
Learn how we may evaluate assemblies.

1.1 Software used in this module

spades.
bbmap. We use only the BBduk and BBmerge part of BBmap here.
quast.
canu.

2 Sequence assembly

An important step in many analyses of sequencing data is to try to re-construct the genome(s) we sequenced by assembling the many (short) reads into longer sequences. We will first have a look at some ideas behind this, before we do some assembly on orion.

2.1 Online courses

Let us again use YouTube, and have a look at what Rob Edwards has to say about genome assembly:

In the above videos, Rob Edwards mentions four different approaches to genome assembly, the naive, the greedy, the OLC and the De Bruijn graph ideas. What is actually the difference between them? Do you get an impression of why the De Bruijn graph is such a good idea, compared to the others?

Below we will have a look at another online course on YouTube, by Ben Langmead. It is always a good idea to get input from various sources! In the course ADS1, Ben Langmead talks about many different topics, but let us only focus on the videos about assembly here. This is a few years old, but most of it is still valid. He uses the phrases prefix and suffix that we should know: These are parts of a sequence. The prefix of length $n$ is simply the first $n$ bases of a sequence. The suffix of length $n$ is the last $n$ bases.

The next two videos shows a method that is not actually used, but it may still be a good entrance to later ideas:

Here is the real problem we are faced with when trying to assemble short read data:

ADS1: Third law of assembly: repeats are bad

Here are the methods we use in real life, and why they work/do not work:

2.2 Lecture

In addition to what the experts above had to tell us, I have also summarize some aspects of the use of the De Bruijn graph idea in a short lecture.

Lecture on the De Bruijn graph idea

3 Short read (Illumina) data

Let us start out by focusing on data as paired-end reads from some Illumina sequencing platform. Let us try to assemble such reads into longer sequences, called contigs.

3.1 Assembly with `spades`

We have already used the software spades in module 3, but let us use it for assembly, which is what it is intended for. This is in many ways the standard software for assembling short read data in microbiology.

Let us make a shell script with the required code to do the assembly of the same data (fastq files) we looked at in the previous modules. We now make use of the same container we downloaded in module 3. In the code below I assume this is found in the folder $HOME/module3/, but you must edit this to fit how it looks like under your $HOME! Also, the container file itself may have a slightly different name:

#!/bin/bash

#SBATCH --nodes=1                        # We always use 1 node
#SBATCH --reservation=BIN310             # For BIN310 course only
#SBATCH --account=bin310                 # For BIN310 course only
#SBATCH --ntasks=10                      # The number of threads reserved
#SBATCH --mem=99G                        # The amount of memory reserved
#SBATCH --time=05:00:00                  # Runs for maximum this time
#SBATCH --job-name=spades                # Sensible name for the job
#SBATCH --output=spades_%j.log           # Logfile output here

##############
### Settings
###
threads=10
memory=99
r1=$COURSES/BIN310/module2/Illumina_MiSeq_R1.fastq.gz
r2=$COURSES/BIN310/module2/Illumina_MiSeq_R2.fastq.gz
out_dir=$SCRATCH/spades

######################
### Running software
###
apptainer exec $HOME/module3/spades:3.15.5--h95f258a_1.sif spades.py \
-o $out_dir --tmp-dir $TMPDIR --memory $memory --threads $threads --isolate \
--pe-1 1 $r1 --pe-2 1 $r2

#############################
### Copying contigs to $HOME
### and cleaning
cp $out_dir/contigs.fasta spades_contigs.fasta
# add command to delete unneeded stuff here

Notice that here I used the same apptainer container that I downloaded in module 3. This I saved as a .sif file under my $HOME/module3/ folder. You may have saved it in another folder and/or with a different file name, please edit the code accordingly. You could also copy the container file into your module4 folder. It is not super big, and having multiple copies is not a huge waste of disk space.

Make a new shell script (in a module4 folder) and copy the codes above, edit, save and sbatch to SLURM. Assembly is a large task compared to what we have seen before, and will take longer time. Notice we set the wall time (#SBATCH --time) to 5 hours here. The log file produced by spades is very rich in information, you may inspect this as it proceeds, to get some idea of the progress.

Notice that in the command line we enter the read-files as --pe-1 1 $r1 --pe-2 1 $r2. The integer (1) between --pe-1 and $r1 indicates this is data set 1. You may enter several data sets at once. This is more relevant for metagenomes, where we have sequenced (almost) the same metagenome several times, and want to use all reads for assembly (co-assembly). If you look at the help text for spades you will also find a range of other input options. Notice also we set the memory limit to 99GB. Assembly is memory-intensive.

The spades assembler is based on the De Bruijn graph idea that was mentioned in the films above. You may notice from the log file, and the output, that spades uses K-mers of multiple lengths. First, it starts by chopping all reads into 21-mers, i.e. $K=21$, and build a De Bruijn graph from this. A smallish value for $K$, like this, is robust against sequencing errors, but will also be sensitive to repeated regions, and this typically ends with a large number of rather short contigs. Then, spades re-starts the process, using $K=33$ and the contigs from the previous step as ‘extra reads’ helping to build a new De Bruijn graph. This typically ends in fewer, but longer, contigs. In the $out_dir you specified above you will see folder with names K21, K33 etc. with output from these steps.

This process is then iterated for larger values of $K$. Thus, spades uses a whole range of K-mers, and re-builds several De Bruijn graphs, in order to get the most out of the reads. This has turned out a smart way of proceeding, and spades is considered the best choice of assembler for short read microbial data. It is, however, slow and requires a lot of memory. For these reasons, other tools are sometimes used, producing almost as good results but much faster and with less memory.

The final result we are interested in is the file contigs.fasta in the $out_dir. We may also be interested in inspecting some of the other output, but we will not dig into this here. It is mostly for solving problems that may occur. Usually, the contigs file and the log file is what we look into after a spades assembly. In the script above we copy the file contigs.fasta to another folder. The reason is we may then delete the entire $out_dir folder! Remember to do this, since spades produces a lot of output, and the disk will soon be filled if you keep all this.

3.1.1 Exercise - re-run with `--careful`

In the above command line we used the option --isolate. This indicates to spades we have data from a single isolate (genome), with high read coverage. If we have data with too low read coverage, this may fail, and spades crashes after some time. Then, replace --isolate with --careful and you should at least get some results.

Edit the script above by replacing --isolate with --careful, change the name of the file with the contigs, and re-run.

Inspect the log files for both runs. What is the difference with respect to the use of BayesHammer?

3.1.2 Exercise solution

It looks like with the --careful option the BayesHammer is used before the assembly, but with the --isolate option it is not.

3.2 Data pre-processing

In some cases it could be wise to do some pre-processing of our raw data prior to using spades or other assembly software. We should, however, always be a little careful since each time we do something to our data, there is also a chance we are destroying something. We have seen how spades have its own built-in error-correction (BayesHammer), and it is not obvious that pre-processing is a benefit.

We saw already in module 2 how we could use the software fastp for filtering data based on the fastq quality scores. Here we will do something similar, using a different software tool. There is no reason we could not have used fastp here as well, but there are multiple tools out there for doing this, and here is another example of this. In this section we will use the BBmap software, and the apptainer container for this at galaxy.

3.3 Adapter trimming

The Illumina sequencing technology relies on attaching some adapter sequences to the DNA fragments prior to sequencing. These are technical sequences, and have nothing to do with the organism we sequenced. They should largely be removed by the Illumina machine, but remains of this may still be present in our reads. It makes sense to remove this, since it is a contamination of our data.

We will use the BBduk part of the BBmap software to do this. This software needs a file with known adapter sequences, and will then map K-mers from reads to these in order to locate their potential presence. In the command line below we also do some trimming and filtering of the reads, much like we id with fastp in module 2.

First, download the newest bbmap container from galaxy, and store it in your module4 folder. The procedure is the same as you used for the other containers above. In the code below I have just pasted in some container file name, you need to edit this if it is not the same as you downloaded.

Here is some code to make a script running BBduk on the data we have used before, copy and edit:

#!/bin/bash

#SBATCH --nodes=1                        # We always use 1 node
#SBATCH --reservation=BIN310             # For BIN310 course only
#SBATCH --account=bin310                 # For BIN310 course only
#SBATCH --ntasks=10                      # The number of threads reserved
#SBATCH --mem=10G                        # The amount of memory reserved
#SBATCH --time=01:00:00                  # Runs for maximum this time
#SBATCH --job-name=preproc               # Sensible name for the job
#SBATCH --output=preproc_%j.log          # Logfile output here

#############
### Settings
###
adapters=$COURSES/BIN310/module4/adapters.fa
threads=10
out_dir=$SCRATCH/bbduk
r1_in=$COURSES/BIN310/module2/Illumina_MiSeq_R1.fastq.gz
r2_in=$COURSES/BIN310/module2/Illumina_MiSeq_R2.fastq.gz
r1_out=$out_dir/$(basename $r1_in)
r2_out=$out_dir/$(basename $r2_in)
unpaired_out=$out_dir/unpaired.fastq.gz

###############################
### The trimming and filtering
###
apptainer exec $HOME/module4/bbmap:39.01--h92535d8_1.sif bbduk.sh \
t=$threads in1=$r1_in in2=$r2_in ref=$adapters out1=$r1_out out2=$r2_out outs=$unpaired_out \
ktrim=r k=23 mink=11 hdist=1 tpe tbo

Notice, first we use a bbmap container at galaxy, but the command for running the trimming and filtering is bbduk.sh, which is part of this container. We will use another command from the same container below.

The adapters are in a fasta file specified in $adapters. This is something you may also copy from the BBmap website (see below). This file contains a number of known Illumina adapters, and most likely some of these have been used when sequencing our data.

We output to some folder under $SCRATCH, it will be created if not already existing. We do not copy the output files from this folder to our $HOME, which is something we could have done.

The out-file names must also be specified. Again we use the basename UNIX command, and re-use the input file names, but we also need a third output file, for the unpaired reads. Just as we saw for BayesHammer (module 3) or fastp (module 2) this step will also result in some reads losing their mate, and we get a bunch of unpaired reads in addition to those still in pairs.

The command line itself contains a lot, and I will not dig into this here now. The complete guide for BBmap is found here:. You may study this to sort out what the codes in the last line of the command line means. It is basically about trimming and filtering reads based on quality scores.

3.4 Merging reads

Before we start the spades assembly, it could be an advantage to merge reads. Remember, with paired-end sequencing we sequence both ends of a fragment of DNA. If the sequenced DNA fragment is short, the paired-end reads will overlap each other. Assume the fragment is 500 basepairs long, and the reads are both 300 bases. Then the reads will overlap each other by 100 bases, i.e. the last 100 bases of the R1 reads is the reverse-complement of the last 100 bases of the R2 read. In such cases it could be a benefit to actually merge these into one read, i.e. instead of having two reads both 300 bases long, we get one read which is 500 bases. Not all reads overlap like this, but let us run the paired reads through BBmerge to see what happens. Here we now simply extend the BBduk script from above, having both steps in the same script:

#...see script above...
#...then add the following below that...

#############
### Settings
###
out_dir=$SCRATCH/bbmerge         # new out_dir
r1=$r1_out                       # from above
r2=$r2_out                       # from above
r1_out=$out_dir/$(basename $r1_in)
r2_out=$out_dir/$(basename $r2_in)
merged_out=$out_dir/merged.fastq.gz

#####################
### The merging
###
apptainer exec $HOME/module4/bbmap:39.01--h92535d8_1.sif bbmerge.sh \
t=$threads in1=$r1_in in2=$r2_in out=$merged_out outu1=$r1_out outu2=$r2_out

Make a new shell script in your module4 folder, and add the code above, and sbatch the script. You should get three compressed fastq files in both of the $out_dir folders we specified above, and a log file in your module4 folder.

3.5 Is pre-processing really a good idea?

Let us have a look at the results we got above, and discuss if pre-processing is a good idea or not.

Trimming of adapters seems like a good idea, since adapters are artificial sequences we do not want in our reads. The only potential problem may be if there are many adapters to match against, and these are rather short. Then some reads may randomly contain a segment that matches an adapter, but is still ‘real’ DNA. Thus, we may remove something which is actually a valid genomic sequence.

If we inspect the log file from the above run, I find this section:

Input:                      1191850 reads       358746850 bases.
KTrimmed:                   16038 reads (1.35%)     1131824 bases (0.32%)
Trimmed by overlap:         2476 reads (0.21%)  20532 bases (0.01%)
Total Removed:              256 reads (0.02%)   1152356 bases (0.32%)
Result:                     1191594 reads (99.98%)  357594494 bases (99.68%)

From the last two lines I read that almost no reads were discarded, and around 1% of the reads were trimmed, i.e. had something that matched adapters. This sounds good, we do not expect too many reads to have adapters, even if this happens from time to time.

When it comes to the merging, it is obvious that if very few read-pairs are merged, we can do just as well without this step. Scrolling down the log file to the merging, I find this section:

Pairs:                  595797
Joined:                 268159      45.008%
Ambiguous:              324240      54.421%
No Solution:            3398        0.570%
Too Short:              0           0.000%

Here 45% of the read pairs could be merged. In this case merging makes sense, if this percentage is very low (< 10%) merging will not have any impact.

If the merging percent is very high, e.g. ~90%, you should be aware! It means that the fragments that were sequenced are all quite short since the vast majority of read-pairs can be merged. This is in general not a good thing when it comes to assembly! As mentioned in one of the videos by Rob Edwards above, paired-end reads are useful since they span a larger region of the DNA. Even if we do not sequence the middle part of the fragment, we know the reads in a pair come from the ends of the same fragment/region, and this information is used to resolve the problems with repeats. Having too short fragments, this benefit is almost lost! The two reads in a read pair are now just two versions of the same short region. It is a waste of sequencing resources to do paired-end on such short fragments. In such cases this should be reported to see if it is possible to alter the lab procedures to get longer fragments.

Is merging of the reads really needed before assembly? Given the De Bruijn graph idea, shouldn’t the overlapping K-mers pick up all this information anyway? The short answer is yes. Merging is rarely important for the quality of the final contigs, but there is one more issue: Building the De Bruijn graph takes a lot of memory. We use less memory if we use the merged reads. This is hardly important for single genome assembly, but when we comes to metagenomes, this may be of some value. Perhaps the best argument for including the merging step is we may discover that our reads overlap too much, and that this will most likely lead to some problems in the assembly later, regardless of whether we use merged reads or not as input.

The people behind the spades software has some recommendations on data pre-processing here

3.5.1 Exercise - run `spades` on pre-processed data

Now, edit the spades script from previously (use the --isolate option again) to use the pre-processed data as input. This means you use as input

The R1 file after merging
The R2 file after merging
The file with merged reads after merging
The file with unpaired reads after trimming

All these four inputs may be listed, with proper options in spades. Have a look at the help text (spades.py --help) or the manual here for how to include the merged and unpaired reads.

Remember to copy the resulting contigs to your module4 folder and give it a different name than those from before (e.g. spades_contigs_preproc.fasta). We will later compare the contigs we got from spades using both raw and pre-processed data.

3.5.2 Exercise solution

Suggested solution shell script (remember to edit the paths):

#!/bin/bash

#SBATCH --nodes=1                        # We always use 1 node
#SBATCH --reservation=BIN310             # For BIN310 course only
#SBATCH --account=bin310                 # For BIN310 course only
#SBATCH --ntasks=10                      # The number of threads reserved
#SBATCH --mem=99G                        # The amount of memory reserved
#SBATCH --time=05:00:00                  # Runs for maximum this time
#SBATCH --job-name=spades                # Sensible name for the job
#SBATCH --output=spades_%j.log           # Logfile output here

##############
### Settings
###
threads=10
memory=99
r1=$SCRATCH/bbmerge/Illumina_MiSeq_R1.fastq.gz
r2=$SCRATCH/bbmerge/Illumina_MiSeq_R2.fastq.gz
merged=$SCRATCH/bbmerge/merged.fastq.gz
unpaired=$SCRATCH/bbduk/unpaired.fastq.gz
out_dir=$SCRATCH/spades_preproc

######################
### Running software
###
apptainer exec $HOME/module3/spades:3.15.5--h95f258a_1.sif spades.py \
-o $out_dir --tmp-dir $TMPDIR --memory $memory --threads $threads --isolate \
--pe-1 1 $r1 --pe-2 1 $r2 --s 1 $unpaired --merged $merged

#############################
### Copying contigs to $HOME
### and cleaning
cp $out_dir/contigs.fasta spades_contigs_preproc.fasta
# add command to delete unneeded stuff here

4 Evaluating assemblies

How good are the assemblies we get? In any real situation we do not know exactly how the sequenced genome actually looks like. We may, however, have the genomes of some close relative to compare against. But, in some cases we don’t even have this. Then, what can we do?

4.1 Basic assembly statistics

An assembly has many facets, and the ‘goodness’ depends partly on what we are looking for. There are, however, some basic quantities we usually compute:

Number of contigs
The N50 or similar measures
Coverages, either K-mer or read coverage

The number of contigs is the easiest. Ideally, if we sequence a single chromosome, a perfect assembler should give us 1 single contig spanning the entire molecule. In reality this never happens (with short read data), but as few contigs as possible seems to be a criterion for goodness. Remember, even a prokaryotic genome is not always 1 DNA molecule. Bacteria frequently have plasmids, who are small ‘extra chromosomes’. These should ideally end up as separate contigs.

On the other hand, if the chromosome is 3 million bases long, we could imagine the following two assembly results:

Assembly 1 gives 100 contigs, where the longest is 2.9 million bases, and the remaining 99 are all around 1000 bases.
Assembly 2 gives 10 contigs, all between 200 and 400 thousand bases.

Most people will say assembly 1 is the best, even if it has most contigs, since we manage to re-construct one very large contig. The many small contigs are not important, simply because they are small. This is the reason for the N50 statistic.

This is computed as follows:

Arrange all contigs in sorted order by their length, longest first.
Sum all contig lengths to the total of $T$.
Start at the first (longest) contig, and accumulate contig lengths until you reach $0.50\cdot T$.
The length of the contig where you reached $0.50\cdot T$ is the N50 value.

Similarly we may define N75, N90 etc, by changing the threshold to $0.75$ or $0.90$.

With respect to coverage, we would ideally expect to see similar values for all contigs, with some minor variance only. If some contigs have much higher coverage, it is an indication of a repeated region, or possibly a plasmid which is often found in multiple copies in each cell. A very low coverage is perhaps an indication of some contigs made from error reads.

If we have a reference genome, i.e. a genome sequence of something very similar to what we now have assembled, we may align the contigs to this. From this we can compute how large fraction of that genome is covered by contigs. We may also detect badly assembled contigs if some part of it comes from a certain region of the reference, but another part from a completely different region on the reference. We should, however, be a little careful here, since bacteria are known to shuffle genomic regions, and even close relatives may have such differences.

4.1.1 Exercise - contigs statistics in R

Make an R script in your module4 folder where you read in the file spades_contigs.fasta that we got from the first assembly above. Use the readFasta() function again, to produce a table with Header and Sequence for each contig. Let us call the table contigs.

Add columns named Length and Kmer.coverage to the table. Both are listed in the Header of each contig. These Headers all follow the pattern NODE_<X>_length_<length>_cov_<kmer-coverage> where the three texts inside an < > are different for each line. Thus, if we split this text by each underscore "_" it consists of 6 words. Word number 4 gives you the length, and word number 6 the K-mer coverage.

Extract words by using the word() function in R, e.g. word number 4 is extracted by word(Header, 4, 4, sep = "_"). Remember to also convert Length to a number. When extracting it from a text by word() you get a text, not a number. Use as.numeric() on that text to convert it to a number. The same applies to Kmer.coverage.

Compute the N50 value from the data in the Length column. You need to

Sort this vector in decreasing order
Sum up all lengths to get the total length
Compute the cumulative sum of the sorted contig lengths
Find the first contig where the cumulative length is more than half the total
The length of this contig is the N50 value

Here is a way to start, assuming the table contigs have the columns mentioned above

library(tidyverse)
library(microseq)

contigs.tbl <- readFasta("spades_contigs.fasta") %>% 
  mutate(Length = as.numeric(word(___, ___, ___, sep = "_"))) %>% 
  mutate(Kmer_coverage = as.numeric(______)) %>% 
  arrange(___)

Tot.length <- sum(___)
Cum.length <- cumsum(___)
idx <- min(which(___ >= ___ * 0.5))
N50 <- ___[idx]

Make a plot of the K-mer coverage versus the contig length for each contig. The title of the plot should state the total number of contigs, and the N50 value.

4.1.2 Exercise solution

library(tidyverse)
library(microseq)

contigs.tbl <- readFasta("spades_contigs.fasta") %>%
  mutate(Length = as.numeric(word(Header, 4, 4, sep = "_"))) %>%
  mutate(Kmer_coverage = as.numeric(word(Header, 6, 6, sep = "_"))) %>%
  arrange(desc(Length))

Tot.length <- sum(contigs.tbl$Length)
Cum.length <- cumsum(contigs.tbl$Length)
contig.number <- min(which(Cum.length >= Tot.length * 0.5))
N50 <- contigs.tbl$Length[contig.number]

fig <- ggplot(contigs.tbl) +
  geom_point(aes(x = Kmer_coverage, y = Length)) +
  scale_y_log10() + scale_x_log10() +
  labs(x = "Kmer coverage", y = "Contig length", title = str_c(nrow(contigs.tbl), " contigs, N50=", N50))
print(fig)

4.2 The `quast` software

The software quast may be used to compute the above statistics, and a range of other assessments. Especially if we have a reference genome this is a nice tool. A reference is an already sequenced genome that is very close to the one we assembled, and by comparing the assembled contigs to this we get a picture of how good our assembly is. In the case of the data we have been working with, we have indeed a very good reference genome. You find it in

$COURSES/BIN310/module4/Bacillus_cereus_ATCC14579.fasta

This is actually the exact same strain that was sequenced! Thus, we should ideally get 100% identity if the assembly is perfect. For all details about all the possibilities in using quast, consider the quast manual.

Let us first consider the contigs we got from spades on the raw data, and run quast on these. First, download the newest quast container from galaxy, and save it in your module4 folder.

Here is a shell script code you can copy and edit:

#!/bin/bash

#SBATCH --nodes=1                        # We always use 1 node
#SBATCH --reservation=BIN310             # For BIN310 course only
#SBATCH --account=bin310                 # For BIN310 course only
#SBATCH --ntasks=10                      # The number of threads reserved
#SBATCH --mem=10G                        # The amount of memory reserved
#SBATCH --time=01:00:00                  # Runs for maximum this time
#SBATCH --job-name=quast                 # Sensible name for the job
#SBATCH --output=quast_%j.log            # Logfile output here

##############
### Settings
###
contigs=$HOME/module4/spades_contigs.fasta  # we copied this to our moduel4 folder when running spades above
ref=$COURSES/BIN310/module4/Bacillus_cereus_ATCC14579.fasta
out_dir=$HOME/module4/quast_spades
threads=10

##################
### Running quast
###
apptainer exec $HOME/module4/quast:5.2.0--py39pl5321h4e691d4_3.sif quast.py \
--threads $threads -r $ref --output-dir $out_dir $contigs

Notice we did not output to some folder under $SCRATCH this time, and the only reason is we want click on an .html file in the output folder. Having this folder inside the module4 folder is then handy.

In RStudio, click into the quast_spades folder, and click open the report.html (View in Web Browser).

Here is a short video discussing the output we see.

4.2.1 Exercise - compare to assembly after pre-processing

Above we also did some pre-processing of the data before we assembled with spades. Now, edit the quast script, using these contigs instead. Also, output to some other folder. Compare the results. Did the pre-processing of the data make any difference to the assembly evaluation result?

4.2.2 Exercise solution

#!/bin/bash

#SBATCH --nodes=1                        # We always use 1 node
#SBATCH --reservation=BIN310             # For BIN310 course only
#SBATCH --account=bin310                 # For BIN310 course only
#SBATCH --ntasks=10                      # The number of threads reserved
#SBATCH --mem=10G                        # The amount of memory reserved
#SBATCH --time=01:00:00                  # Runs for maximum this time
#SBATCH --job-name=quast                 # Sensible name for the job
#SBATCH --output=quast_%j.log            # Logfile output here

##############
### Settings
###
contigs=spades_contigs_preproc.fasta
ref=$COURSES/BIN310/module4/Bacillus_cereus_ATCC14579.fasta
out_dir=quast_preproc
threads=10

##################
### Running quast
###
apptainer exec $HOME/module4/quast:5.2.0--py39pl5321h4e691d4_3.sif quast.py \
--threads $threads -r $ref --output-dir $out_dir $contigs

Here is a short video discussing the results

5 Long read (Nanopore) assembly

We have so far focused on Illumina paired-end short read data. This is still the most common data type in microbial genomics, but things are changing, and we expect to see a lot more long read data in the near future. Especially data from the Oxford Nanopore Technology (ONT) or just Nanopore.

As we already know:

Short read data are high precision, but are sensitive to repeats.
Long read data are (currently) low precision, but have the length we require to resolve repeats.

As the sequencing technology improves, we will most likely see data with gradually longer are more precise reads. Eventually we may hope to get the entire, and correct, genome sequence directly from the sequencer, and the whole assembly step is left in the junkyard as yesterdays science! Well, until we are there we must be capable of turning Nanopore long reads into assembled genomes. We will do this now, and compare the results we get to those we got with short reads above.

5.1 Nanopore data quality

The raw output from Nanopore sequencing is a set of files in the fast5 format. These are binary files we cannot inspect manually, and their format has not always been very well described by the producers. Depending on the sequencing machine used, there is one such file for each read, or multi-read fast5 files. From these fast5 files, as base-caller software is used to produce the fastq files with the reads. Several base-caller softwares are around, e.g. guppy, and this is still under development. From this process a number of reads are already discarded, having too low precision. A threshold of quality score 7(!) is commonly used, which is itself very low.

We saw above that for short reads data pre-processing may or may not be a benefit for the assembly step. For long read data the latter is perhaps even more so. As an example, adapters may be present, potentially once in each read. In a set of many and short reads, this may then be a rather large fraction of the data. For a few long reads, one short adapter in each of them is an ignorable fraction of the total. Also, trimming based on sequencing quality, which is sometimes done for short reads, is less straightforward. The Illumina reads are usually of very high precision, and we may throw away the odd ‘rubbish’ without losing too much of our data. The nanopore reads are low precision from the beginning, and sorting out ‘rubbish’ is very difficult without losing useful data. We also saw that spades has a built-in error-correction step. This is also the case for some of the long-read assemblers, making any pre-processing even less important. Thus, we will not devote any attention to pre-processing Nanopore data here.

5.2 Assembly with `canu`

The canu software has been built from the age-old Celera assembler that was made during the Human Genome Project in the previous century. Note that this is not a De Bruijn graph assembler! Instead, it uses the idea of Overlap-Layout-Consensus (OLC). Remember, that the De Bruijn graph idea was adopted when NGS technology started to produce a huge number of short reads. The OLC strategy worked fine with the Sanger technology, where reads were long and few. Finding overlaps for a huge number of NGS reads turned out too computational expensive. However, with todays long read technologies, we are back where we started in this sense, and the OLC idea is again a good approach to assembly.

The canuis only one of a range of software tools for the assembly of Nanopore data, see the article https://f1000research.com/articles/8-2138 for a fairly recent benchmarking of such tools. Also, for an overview of the entire processing of Nanopore data, see the article https://academic.oup.com/bib/article/20/4/1542/4958758. We will not dig into these methods in BIN310, but the course BIO326 devotes more time to the full treatment of Nanopore data.

The canu also starts by error correction of the reads, and we may feed the fastq file directly to the assembler. Let us run the canu on some Nanopore data for the exact same genome that we have been working on so far. You find the single fastq file with Nanopore data in $COURSES/BIN310/module4/Nanopore.fastq.gz

Again, download the newest canu container from galaxy, and put it in your module4 folder. Remember to edit the container file name in the code below accordingly.

Here is some code you copy and edit:

#!/bin/bash

#SBATCH --nodes=1                        # We always use 1 node
#SBATCH --reservation=BIN310             # For BIN310 course only
#SBATCH --account=bin310                 # For BIN310 course only
#SBATCH --ntasks=10                      # The number of threads reserved
#SBATCH --mem=99G                        # The amount of memory reserved
#SBATCH --time=10:00:00                  # Runs for maximum this time
#SBATCH --job-name=canu                  # Sensible name for the job
#SBATCH --output=canu_%j.log             # Logfile output here

##############
### Settings
###
threads=10
memory=99g
genome_size=5m
reads=$COURSES/BIN310/module4/Nanopore.fastq.gz
out_dir=$SCRATCH/canu
prefix=Nanopore

######################
### Running software
###
apptainer exec $HOME/module4/canu:2.2--ha47f30e_0.sif canu \
-d $out_dir -p $prefix genomeSize=$genome_size maxThreads=$threads maxMemory=$memory -nanopore $reads

######################################
### Copying contigs to $HOME
### NB! Do not delete $out_dir yet!
###
cp $out_dir/$prefix.contigs.fasta canu_contigs.fasta

Make a shell script with the above code, and sbatch it as before. This will take quite some time to complete, hours rather than minutes. Note that we set the wall time to 10 hours here.

Notice we specify the maximum number of threads and memory that canu should use. It may use less, at least in certain steps. We also supply a ‘guess’ of the genome size. This is not critical for the end result, for most bacteria 3 million basepars (3m) would probably be close enough. Here we happen to know the genome is around 5m. The $prefix simply specifies the ‘first’ name of the output files.

Much like spades, the canu also produces a lot of output. The contigs we are most interested in are in the file $out_dir/$prefix.contigs.fasta (where $prefix is replaced by the text we give to this shell variable in our script). As a by-product we also get corrected and trimmed reads, in the fasta files $out_dir/$prefix.correctedReads.fasta and $out_dir/$prefix.trimmedReads.fasta.

Once we have the assembly, it is not uncommon to try to polish this, making use of the raw fast5 files to try to correct mistakes in the assembly, see the article https://academic.oup.com/bib/article/20/4/1542/4958758 mentioned above. We will only consider unpolished assemblies here this time, they are still quite good.

5.2.1 Exercise - length of reads

Read the Nanopore fastq file into R (use readFastq()). In R you cannot use the environment variable COURSES (not without some extra code line). Use instead the full path, replace $COURSES with /mnt/courses/. Compute the length of all reads, and store as the column Length in the table. As usual, when creating a new column we use mutate(), and compute the lengths from the Sequence using str_length(Sequence) inside mutate()).

Then do the same for the fasta files (use readFasta()) with both corrected and trimmed reads. You should then have 3 tables.

How many reads are in the corrected/trimmed files compared to the raw file?

What is the length distribution of these reads? Make histograms and compare. Which reads have been discarded by canu?

5.2.2 Exercise solution

### Here I read from MY SCRATCH disk folder (/mnt/SCRATCH/bin310-23-48/), you should change this to yours...
library(microseq)

fq.raw <- readFastq("/mnt/courses/BIN310/module4/Nanopore.fastq.gz") %>%
  mutate(Length = str_length(Sequence))
fq.cor <- readFasta("/mnt/SCRATCH/bin310-23-48/canu/Nanopore.correctedReads.fasta.gz") %>%   # edit path!
  mutate(Length = str_length(Sequence))
fq.trm <- readFasta("/mnt/SCRATCH/bin310-23-48/canu/Nanopore.trimmedReads.fasta.gz") %>%      # edit path!
  mutate(Length = str_length(Sequence))

### Putting all lengths into a single table, and with the extra column
### indicating from which file they came.
### Could also make 3 separate plots and compare.
reads.tbl <- fq.raw %>%
  select(Length) %>%
  mutate(File = "raw")
reads.tbl <- fq.cor %>%
  select(Length) %>%
  mutate(File = "corrected") %>%
  bind_rows(reads.tbl)
reads.tbl <- fq.trm %>%
  select(Length) %>%
  mutate(File = "trimmed") %>%
  bind_rows(reads.tbl) %>%
  mutate(File = factor(File, levels = c("raw", "corrected", "trimmed")))
### The last use of factor here is just a trick to make the panels appear
### in the order raw-corrected-trimmed in the plot below

fig <- ggplot(reads.tbl) +
  geom_histogram(aes(x = Length), bins = 30) +
  facet_wrap(~File, ncol = 1)
print(fig)

5.3 Briefly on the `canu` procedure

The canu pipeline is described at https://canu.readthedocs.io/en/latest/tutorial.html as:

“The canu pipeline, that is, what it actually computes, comprises of computing overlaps and processing the overlaps to some result. Each of the three tasks (read correction, read trimming and unitig construction) follow the same pattern:

Load reads into the read database, gkpStore.
Compute k-mer counts in preparation for the overlap computation.
Compute overlaps.
Load overlaps into the overlap database, ovlStore.
Do something interesting with the reads and overlaps.
- The read correction task will replace the original noisy read sequences with consensus sequences computed from overlapping reads.
- The read trimming task will use overlapping reads to decide what regions of each read are high-quality sequence, and what regions should be trimmed. After trimming, the single largest high-quality chunk of sequence is retained.
- The unitig construction task finds sets of overlaps that are consistent, and uses those to place reads into a multialignment layout. The layout is then used to generate a consensus sequence for the unitig.”

A unitig we may think of as simply a unique and corrected/trimmed contig (long read) ready for assembly. Notice that the correction/trimming does not rely on the quality scores, but rather on overlaps. This is quite typical of long-reads. The quality scores are not that useful anymore, but due to their length they should overlap a lot, and reads with too small overlaps are probably rubbish. Thus, short reads are probably rubbish!

5.4 Evaluate assemblies

We may of course use the contigs from canu, or any other assembler, as input to quast as we did above, and get the same type of assembly evaluation.

5.4.1 Exercise - evaluate the `canu` contigs

Edit the quast script from above, and use the contigs from canu instead of those from spades. Compare to what you got with spades. It is the same reference genome.

5.4.2 Exercise solution

#!/bin/bash

#SBATCH --nodes=1                        # We always use 1 node
#SBATCH --reservation=BIN310             # For BIN310 course only
#SBATCH --account=bin310                 # For BIN310 course only
#SBATCH --ntasks=10                      # The number of threads reserved
#SBATCH --mem=10G                        # The amount of memory reserved
#SBATCH --time=01:00:00                  # Runs for maximum this time
#SBATCH --job-name=quast                 # Sensible name for the job
#SBATCH --output=quast_%j.log            # Logfile output here

##############
### Settings
###
contigs=canu_contigs.fasta
ref=$COURSES/BIN310/module4/Bacillus_cereus_ATCC14579.fasta
out_dir=quast_canu
threads=10

##################
### Running quast
###
apptainer exec quast:5.2.0--py39pl5321h4e691d4_3.sif quast.py \
--threads $threads -r $ref --output-dir $out_dir $contigs

Yet another short video discussing the results

5.5 Hybrid assemblies

In some cases we have both short-read and long-read data for the same samples. This is truly luxury. We then have both very precise (short) reads and long-overlapping (long) reads. Combining both data means we do a hybrid assembly.

The spades has an option for making such hybrid assemblies, you may add the long-reads by the --nanopore option. Then, spades will build contigs from the short reads as before, and then use the long reads to ‘bind together’ the resulting contigs into longer contigs. A similar overall approach is used by unicycler software, see https://github.com/rrwick/Unicycler#method-hybrid-assembly.

We will not dig into hybrid assembly here, as these type of data are not that common, simply because we need to sequence the same sample twice, making it costly in terms of time and money. My personal experience is also that this rarely improves the assembly very much, at least for single genome sequencing. Often the results are very good already with using only one of the data types, it is difficult to improve something which is very good! For metagenomes, that we will look into later, assemblies are poorer, and then this may make more sense. As the long-read technologies improve the precision, it is also likely that this will be the only data source for genome re-construction in the future.

BIN310 module 4 - Assembly

Lars Snipen

2023

1 Learning goals

1.1 Software used in this module

2 Sequence assembly

2.1 Online courses

2.2 Lecture

3 Short read (Illumina) data

3.1 Assembly with `spades`

3.1.1 Exercise - re-run with `--careful`

3.1.2 Exercise solution

3.2 Data pre-processing

3.3 Adapter trimming

3.4 Merging reads

3.5 Is pre-processing really a good idea?

3.5.1 Exercise - run `spades` on pre-processed data

3.5.2 Exercise solution

4 Evaluating assemblies

4.1 Basic assembly statistics

4.1.1 Exercise - contigs statistics in R

4.1.2 Exercise solution

4.2 The `quast` software

4.2.1 Exercise - compare to assembly after pre-processing

4.2.2 Exercise solution

5 Long read (Nanopore) assembly

5.1 Nanopore data quality

5.2 Assembly with `canu`

5.2.1 Exercise - length of reads

5.2.2 Exercise solution

5.3 Briefly on the `canu` procedure

5.4 Evaluate assemblies

5.4.1 Exercise - evaluate the `canu` contigs

5.4.2 Exercise solution

5.5 Hybrid assemblies

BIN310 module 4 - Assembly

Lars Snipen

2023

1 Learning goals

1.1 Software used in this module

2 Sequence assembly

2.1 Online courses

2.2 Lecture

3 Short read (Illumina) data

3.1 Assembly with spades

3.1.1 Exercise - re-run with --careful

3.1.2 Exercise solution

3.2 Data pre-processing

3.3 Adapter trimming

3.4 Merging reads

3.5 Is pre-processing really a good idea?

3.5.1 Exercise - run spades on pre-processed data

3.5.2 Exercise solution

4 Evaluating assemblies

4.1 Basic assembly statistics

4.1.1 Exercise - contigs statistics in R

4.1.2 Exercise solution

4.2 The quast software

4.2.1 Exercise - compare to assembly after pre-processing

4.2.2 Exercise solution

5 Long read (Nanopore) assembly

5.1 Nanopore data quality

5.2 Assembly with canu

5.2.1 Exercise - length of reads

5.2.2 Exercise solution

5.3 Briefly on the canu procedure

5.4 Evaluate assemblies

5.4.1 Exercise - evaluate the canu contigs

5.4.2 Exercise solution

5.5 Hybrid assemblies

3.1 Assembly with `spades`

3.1.1 Exercise - re-run with `--careful`

3.5.1 Exercise - run `spades` on pre-processed data

4.2 The `quast` software

5.2 Assembly with `canu`

5.3 Briefly on the `canu` procedure

5.4.1 Exercise - evaluate the `canu` contigs