1 Learning goals

Over the first modules we have focused on single genomes. We may have data from many different genomes, but each of them has been sequenced separately, as isolates, i.e. we isolate a single strain of bacteria on the lab, grow them, extract their DNA and sequence this.

We now turn our attention to metagenomes. This means we do not attempt to isolate single bacteria any more, but sequence all DNA we collect from some environment. This is often what we have to do, since isolating the bacteria is simply impossible, we have no methods for growing them in the lab.

A metagenome is not the genome of a single organism, but rather the collection of all different genomes found in some microbial community. The classical example is the metagenome of the human gut. Instead of trying to isolate the various species, we simply extract all DNA from all organisms in the gut, and sequence this. Metagenomics has become an important part of microbiology, and metagenome studies are conducted in many branches of biology these days, trying to understand the role of the microbes in the interplay with animals, plants, soil, water, air…

The course BIO326 is more specifically devoted to metagenomics and the re-construction of genomes. Hence, in this module of BIN310 we only look briefly into this topic, for more on this consider BIO326. We will also recognize computational solutions we used in previous modules, and in many ways the bioinformatics work is rather similar whether you have data from isolates or metagenomes. There are, however, some aspects of metagenome data that makes it necessary with some additional steps, and these are what we focus mostly on here.

  • We first do some simple read coverage computations, in order to take in the limitations of re-constructing genomes from metagenome data.
  • Learn how we can de-contaminate data, i.e. discard reads who are not from the microbial community at all, before we start to assemble the reads.
  • See that assembly can be done quite the same way as we did for isolates, both for illumina and nanopore data.
  • Learn what is meant by binning, and understand and use the software maxbin for doing this step.
  • Learn how we, based on the binning, can extract the final re-constructed genomes, often termed Metagenome Assembled Genomes (MAGs).



1.1 Software used in this module

Note that below you find code where these software tools are used, but we will not actually run them all this time. We have used most of these before, and below you find code illustrating their use for metagenome data





2 Metagenomes

A metagenome is the collective genome of an entire microbial community. Metagenomics is a branch of genomics where we study such communities. We will devote the last modules of this course to metagenomics in some way.

Microbes and microbial communities are everywhere on this planet, and has ‘always’ been here. It is not until fairly recently that we have been able to study this branch of the biology at full range. The reason is that methods in microbiology have required the isolation of single strains. Only a tiny fraction of all microbes can be isolated this way. The rest, typically \(~99\%\), has been termed biological dark matter. The vast majority of biology on planet earth has been invisible to us! With todays sequencing technology we can sequence all DNA in a sample, without the need for isolation. Such data forms the basis for metagenomics, and for the first time in history we can really start to study the full microbial world.

Metagenome data are collected for many reasons. Most studies typically involve

  • Estimating the taxonomic composition. This means finding what types of organisms are in there, and how much of each type. We will come back to this in the next module.
  • Re-construct the genomes of the organisms in there in order to understand how they “work”. This is what we will focus on in this module.

Both these two purposes may just be a first step for subsequent analyses, e.g. the composition may be used as features in a machine learning predicting if something is wrong in an environment (human gut diseases, food or environment contaminations etc.), and finding the genomes may be a first step in further functional studies of what biochemical/biological processes the microbes in the community may have.

2.1 Metagenome sequencing

So far we have only considered the sequencing of a single organism. This means we have a pool with DNA fragments from many cells, but these cells are all copies of each other. This means all fragments are from ‘the same genome’ in the sense that they all will work equally well as building blocks when we try to re-construct the genome. In this setting we also talked about read coverage.

With metagenome data, we have DNA fragments from several different cells, i.e. from different genomes. Assume we have a microbial community with \(M\) different organisms, i.e. there are \(M\) distinct genomes. Let \(G_1\), \(G_2\),…,\(G_M\) be the genome sizes of these community members, and let \(C_1\), \(C_2\),…,\(C_M\) be the genome abundances, i.e. fraction of genomes from each organism in the total number of genomes in the community. Thus, these fractions sum to 1: \[\sum_{i=1}^{M} C_i = 1.0\] We often refer to \(C_i\) as the relative abundance of genome \(i\), and the vector of abundances \({\bf C}=(C_1, C_2,...,C_M)\) is the genomic composition of the community.

When we do shotgun sequencing, we first fragment all genomes randomly, and then sequence these fragments. This means that a genome that has

  • A large abundance
  • or a large genome size

will contribute with more such fragments to the total pool of fragments. Thus, the probability that a read comes from genome \(i\) should be \[P_i = G_i\cdot C_i / A\] where \(A = \sum_{i=1}^M G_i \cdot C_i\). We can think of \(A\) as the size of the fragment pool, and the probability of sequencing genome \(i\) is its fraction of this pool.

What is the read coverage for genome \(i\)? Well, if we have sequenced \(N\) reads in total, the expected number of reads coming from genome \(i\) is \(N_i = P_i N\). We then plug this \(N_i\) value into the formula for read coverage for each genome.

2.1.1 Exercise - metagenome read coverage

Let us simulate in R to get an impression of metagenome read coverages, and how many genomes in a metagenome we can hope to re-construct.

Let us make an R script where we simulate genomes sizes and relative abundances, and then compute read coverages for a given total number of reads \(N\). This is how we start out:

M <- 20                                             # number of genomes
G <- round(runif(M, min = 2000000, max = 5000000))  # typical bacterial genome sizes

Thus, the \(M\) genomes get some uniform random sizes between 2 and 5 mega-basepairs. Edit these limits if you like.

An important property of a microbial community is its diversity. A high diversity we get if there are many different organisms in the community. Also, their distribution matters. If there are many organisms, but only a very few of them dominate in abundance, the diversity is less compared to the situation where all organisms are more or less equally abundant.

When simulating we assume the distribution of relative abundances we assume is exponential. This is of course a simplification, but never very far from the truth, except for communities completely dominated by one or very few organisms. If we order the genomes from most to least abundant, the abundances are \[C_i = a e^{-b(i-1)}\] for \(i=1,...,M\), and where \(a\) is some scaling factor and \(b\) determines the shape of the distribution. Instead of choosing values for \(a\) and \(b\), we determine both by the dynamic range of the community. This is simply the largest (first) abundance divided by the smallest (last). If we set this to some fixed value, and since \(\sum_i C_i = 1.0\), both \(a\) and \(b\) follow. Add this to the script:

dynamic.range <- 10              # largest abundance is 10 times the smallest
b <- log(dynamic.range)/(M-1)
a <- 1 / (sum(exp(-b*(1:M-1))))
C <- a * exp(-b*(1:M-1))

Verify that

  • The abundances in C sum to 1.0
  • The largest abundance divided by the smallest is the dynamic.range

You may also plot the abundances as a bar-chart, should be something like this if you use the simple barplot():

Finally, add the code to compute the read coverage for each genome. Use \(N=1000000\) (one million). Use a typical Illumina read length \(L\). A rule of thumb says we need at least read coverage 10 to have a chance of re-constructing the genome. How many genomes may be re-constructed? Due to the random genome sizes, you need to re-run some times, and compute the average, to conclude.

Play around with different values and their combinations:

  • Increase the number of genomes to 100. This is fairly close to a human gut, but a lot less than in soil or water samples.
  • Increase the dynamic range to 100. This means more dominated by the largest abundances, i.e. lower diversity.
  • Increase \(N\) to 10 millions. A metagenome data set may have 100 million reads or more.

2.1.2 Exercise solution

M <- 20                                            # number of genomes
G <- round(runif(M, min = 2000000, max = 5000000))  # genome sizes
dynamic.range <- 10              # largest abundance is 10 times the smallest
b <- log(dynamic.range)/(M-1)
a <- 1 / (sum(exp(-b*(1:M-1))))
C <- a * exp(-b*(1:M-1))

L <- 300  # 2 * 150
N.tot <- 10000000
A <- sum(G * C)
N <- round(N.tot * (G * C) / A)
D <- (N * L) / G
print(D)
##  [1] 101.50686  89.92152  79.65852  70.56687  62.51284  55.37801  49.05761
##  [8]  43.45842  38.49839  34.10446  30.21205  26.76385  23.70918  21.00320
## [15]  18.60601  16.48243  14.60125  12.93480  11.45847  10.15071





3 Metagenome assembly

The same tools we used on data from single bacteria (isolates), may also be used for metagenome data. The software tools often have a ‘metagenome option’ that changes it behavior slightly, but apart from that the procedure is quite similar.

Before we start the assembly of the data, we need to consider the extra possibility of host contamination which is something we never mentioned for isolates.

3.1 De-contamination

When collecting DNA from a host, e.g. from a person, a fish or some animal, we often find that the cells from the host is very much present in our sample, and that a larger fraction of the total DNA is from the host, not the microbial community we are interested in. Thus, it is quite common to try to get rid of this contaminating DNA prior to assembly. When we talked about assembly in a previous module we mentioned data pre-processing, and host de-contamination is now another step in this. Note that this is only relevant for metagenomes collected from some other living organism that may contribute with contaminating DNA.

In order to do this, you need the genome of the host. Then, we typically use read-mapping that we saw in module 5. We map all the reads to the host genome, and then collect all the reads that did not map for our downstream analysis. It is very important that you have removed adapters prior to this step (another pre-processing), since reads with adapters in them may not map to the host even if they actually should.

Let us look at an example with data from a salmon gut. Here is the core part of a shell script for de-contaminating a pair of Illumina fastq files:

#############
### Settings
###
threads=10
r1=$COURSES/BIN310/module8/salmongut_R1.fastq.gz
r2=$COURSES/BIN310/module8/salmongut_R2.fastq.gz
host_index=$COURSES/BIN310/module8/salmon_genome
out_dir=$SCRATCH/decontaminated
r1_out=$out_dir/$(basename $r1)
r2_out=$out_dir/$(basename $r2)
if [ ! -d $out_dir ]
then
  mkdir $out_dir
fi


###############################
### Readmapping to host and 
### collecting unmapped reads
###
apptainer exec $HOME/module5/bowtie2:2.5.1--py39h6fed5c7_2.sif bowtie2 \
--threads $threads -x $host_index -1 $r1 -2 $r2 \
| apptainer exec $HOME/module5/samtools:1.17--hd87286a_1.sif samtools view \
--threads $threads -b - \
| apptainer exec $HOME/module5/samtools:1.17--hd87286a_1.sif samtools fastq \
-1 $r1_out -2 $r2_out -f 12 -

Note that you may need to edit the path to your containers to make this work.

In addition to the two input files ($r1 and $r2) we need the host genome, in this case the host is salmon (Salmo salar). Here we use bowtie2 for the read mapping, and an index made from the salmon genome is in $host_index. We output the files to some folder under $SCRATCH as usual. This is basically the same we did in Module 5.

The subsequent processing with samtools deserves a comment. We saw in module 5 how we may use samtools to post-process the .sam files produced by read mapping. Here the output is piped directly into samtools view, and its output is again piped into samtools fastq. The latter is the crucial step, collecting some of the reads and writing them to the new files ($r1_out and r2_out). The option -f 12 means we specify to collect only the read-pairs where both reads in a pair do not map to the genome. Let us understand how this works.

Take a look at the website https://broadinstitute.github.io/picard/explain-flags.html.

By giving different values to the -f option, we can turn on various flags, i.e. change the criterion for what types of reads we want to collect. From the website above, we can see there are 12 different flags we can turn on, i.e. ‘read paired’, ‘read mapped in proper pair’, ‘read unmapped’ etc. To indicate what we have turned on, we can write this as a vector with 12 integers, where 0 means ‘turned off’ and 1 means ‘turned on’. Turning everything off means we have the vector 000000000000 (12 zeros).

Now, think of 000000000000 as a binary number with 12 bits (binary digits). This is a number in the number system with 2 as the basis instead of 10 that we are used to. Unfamiliar with the binary number system? This is something you must learn! See https://www.mathsisfun.com/binary-number-system.html.

Bit number 12 corresponds to the first flag, bit number 11 correspond to the second flag etc. We want to turn on flag number 3 and 4 (‘read unmapped’ and ‘mate unmapped’ meaning both reads in a pair is unmapped). This means we want the binary number 000000001100, i.e. the bits corresponding to flag 3 and 4 are set to 1. Now, if you translate this binary number into our usual decimal system, we get the value 12! It is simply \[0\cdot 2^{11} + 0\cdot 2^{10} +...+ 0\cdot 2^4 + 1\cdot 2^3 + 1\cdot 2^2 + 0\cdot 2^1 + 0\cdot 2^0 = 12\] Try out the app in the website above, and see how choosing flags changes the value, or vice versa.

Using binary digits like this is a clever way of finding all possible combinations of on/off…

3.1.1 Exercise - effect on the salmon data

If you try to build the shell script based on the above code, it will take hours to complete since there are quite many reads to map. But, you do not need to!

I have already run the de-contamination on these data, and the resulting fastq files are in the same folder as the input files, with decontaminated in their names. Count how many read-pairs we had originally, and how many we are left with after de-contamination here. The files are big, and it may be a good idea to do this in the Terminal rather than in R. Use zcat <filename> and then pipe this into wc -l. This will then give you the number of lines in the file, and then we know how many reads the fastq file contains. Do this for the raw R1 and the decontaminated R1 file.

The host contamination varies a lot between organisms. Salmon (fish) gut seems to contain a lot of host cells, while in human guts the percentage is usually much lower (but not zero!). Having good wet-lab procedures to get rid of this prior to sequencing is of great value. It is a waste of sequencing efforts if we end up throwing away 90% of the reads before the real analysis starts!

3.1.2 Exercise solution

zcat $cOURSES/BIN310/module8/salmongut_R1.fastq.gz | wc -l
147739372

zcat $cOURSES/BIN310/module8/salmongut_decontaminated_R1.fastq.gz | wc -l
39588516

# looks like we are left with around 26% of the reads after de-contamination, i.e. 74% were from the host!



3.2 Assembly

In the files

  • $COURSES/BIN310/module8/metagenome30_R1.fastq.gz
  • $COURSES/BIN310/module8/metagenome30_R2.fastq.gz

you have Illumina HiSeq reads from a metagenome which is without adapters and with no host DNA. Let us see how we can assemble this with spades. Here is the entire shell script, using the same container we used in Module 4:

#!/bin/bash

#SBATCH --nodes=1
#SBATCH --reservation=BIN310
#SBATCH --account=bin310
#SBATCH --ntasks=10
#SBATCH --mem=250G                       # We need memory for metagenomes
#SBATCH --time=10:00:00                  # We need more time for metagenomes
#SBATCH --job-name=metaspades
#SBATCH --output=metaspades_%j.log

##############
### Settings
###
threads=10
memory=250
r1=$COURSES/BIN310/module8/metagenome30_R1.fastq.gz
r2=$COURSES/BIN310/module8/metagenome30_R2.fastq.gz
out_dir=$SCRATCH/metaspades

######################
### Running software
###
apptainer exec $HOME/module3/spades:3.15.5--h95f258a_1.sif metaspades.py \
-o $out_dir --tmp-dir $TMPDIR --memory $memory --threads $threads \
--pe-1 1 $r1 --pe-2 1 $r2

#############################
### Copying contigs to $HOME
###
cp $out_dir/contigs.fasta metaspades_contigs.fasta

It is actually quite similar to what we did in Module 4.

We use the metaspades.py instead of only spades.py this time. In principle we could have used spades.py and the option --meta. Note we also reserve more memory, here 250 gigabytes, and we need the hugemem partition. The rest is quite similar to the assembly of a single genome.

For the exact same metagenome, we also have Nanopore reads at the same coverage(s). These data are in

  • $COURSES/BIN310/module8/metagenome30.fastq.gz

Here is how we may use canu as we did in Module 4 to assemble these reads:

#!/bin/bash

#SBATCH --nodes=1
#SBATCH --reservation=BIN310
#SBATCH --account=bin310
#SBATCH --ntasks=10
#SBATCH --mem=250G
#SBATCH --time=10:00:00
#SBATCH --job-name=metacanu
#SBATCH --output=metacanu_%j.log

##############
### Settings
###
threads=10
memory=250g
genome_size=5m
reads=$COURSES/BIN310/module8/metagenome30.fastq.gz
out_dir=$SCRATCH/metacanu
prefix=Nanopore

######################
### Running software
###
apptainer exec $HOME/module4/canu:2.2--ha47f30e_0.sif canu \
-d $out_dir -p $prefix genomeSize=$genome_size maxThreads=$threads maxMemory=$memory \
maxInputCoverage=10000 corOutCoverage=10000 corMhapSensitivity=high \
corMinCoverage=0 redMemory=32 oeaMemory=32 batMemory=200 \
-nanopore $reads

#############################
### Copying contigs to $HOME
###
cp $out_dir/$prefix.contigs.fasta metacanu_contigs.fasta

We have added some more options this time, some tweaks suggested by the people behind canu for use on metagenome data. Suggestions like these will change over time and by who you ask, and we do not spend time digging into this here now. You may look this up in the (huge) user manual for canu if you like.

Metagenome assemblies take a lot more time (hours) and resources than single-genome assemblies, and you are not required to actually run the scripts above. Instead, I have copied the resulting contigs to

  • $COURSES/BIN310/module8/metaspades_contigs.fasta
  • $COURSES/BIN310/module8/metacanu_contigs.fasta

Assembling a metagenome is of course more difficult than if all your reads are from a single genome. Reads from different organisms may overlap each other, and contigs may then become ‘mosaics’ of pieces actually belonging to different genomes. We also have a wildly varying coverage, and as we saw above, only the more abundant genome may actually be re-constructed to some degree.

3.2.1 Exercise - evaluate with metaquast

Evaluate the contigs we got from both the Illumina and Nanopore data, listed above. Again we use the quast software as we did in Module 4. Copy the script from module 4, and make the required changes.

The first change is to simply replace quast.py with metaquast.py in the command line.

We would also like to give metaquast some reference genomes to compare against. In this case we also actually have the exact genomes who were in the metagenome. This is of course never the case in real life, but we may have some other genomes quite similar to those we expect to find in the metagenome, and use these as references. The reference genomes are found in

  • $COURSES/BIN310/module8/ref_genomes

Try to find out how you specify this to metaquast.py.

Run this evaluation for both sets of contigs (illumina and nanopore). Output results for both metaspades_contigs.fasta and metacanu_contigs.fasta to two different folders in your module8 folder.

Open the report.html file for both contig sets, and compare. There are many statistics to consider here, but in the plot that is shown, look up the Genome frac plot (click on ‘tabs’ above the plot). This lists how much of each genome was re-constructed. This shows perhaps the most important difference between Illumina and Nanopore data for metagenome assembly.

3.2.2 Exercise solution

#!/bin/bash

#SBATCH --nodes=1
#SBATCH --reservation=BIN310
#SBATCH --account=bin310
#SBATCH --ntasks=10
#SBATCH --mem=10G
#SBATCH --time=01:00:00
#SBATCH --job-name=metaquast
#SBATCH --output=metaquast_%j.log

##############
### Settings
###
threads=10
contigs=$COURSES/BIN310/module8/metaspades_contigs.fasta  # comment in/out
out_dir=metaquast_spades                                  # comment in/out
# contigs=$COURSES/BIN310/module8/metacanu_contigs.fasta    # comment in/out
# out_dir=metaquast_canu                                    # comment in/out
ref=$COURSES/BIN310/module8/ref_genomes

######################
### Running metaquast
###
apptainer exec $HOME/module4/quast:5.2.0--py39pl5321h4e691d4_3.sif metaquast.py \
--threads $threads -r $ref --output-dir $out_dir $contigs

Run first using the spades contigs, then comment out/in the use of canu contigs.

Here is a short video discussing the results.



3.3 Co-assembly

We have seen that in a metagenome, the genomes with the lowest abundance will be difficult or even impossible to assemble. However, sometimes we sequence several, and quite similar, metagenomes. Instead of assembly them one by one, could we not just merge this into one huge meta-metagenome? If the low-abundance genomes are the same in all/most of these read sets, then merging it like this may give a high enough coverage to enable us to assemble some of these that we otherwise miss.

This is what we call co-assembly, i.e. we simply merge fastq-files from several data sets into a huge meta-metagenome, and work on this.

One problem with this approach is the use of resources. Each read-set is usually big, and merging them means we get a really huge read set. Do we have enough memory for this, even on a HPC? How long time will the co-assembly take? This is something you need to consider. One way to answer this is to do some smaller jobs, and then collect the resources used, just like we did in assignment 1.

Once a job has been completed, we may get listed how much memory was consumed and how much CPU-time it took. The command for this is sacct, and here is an example of its use:

sacct -j <jobid> --format=maxrss,cputime,elapsed

where you replace <jobid> by the proper JOBID for the job. The option --format=... is not required, but here we specify that we want to list

  • maxrss. This is the maximum memory used during the job. An average memory is of no interest since the bottleneck is always the maximum needed. A job like assembly will allocate and free memory several times during its computations, and we need to reserve enough in #SBATCH --memory= to meet this maximum.
  • cputime. This is the computing time, i.e. the number of hours the actual processors are in use. This is what you pay for when using sigma2!
  • elapsed. This is the actual time it took to complete the job. When specifying the wall time in our scripts (the #SBATCH --time=), it must be larger than this.

We will not do any co-assemblies here now, since the procedure is the same as for a single sample, you just merge multiple fastq files into a single pair of files before you start.





4 Binning contigs

Genomes re-constructed from metagenomes are often referred to as Metagenome Assembled Genomes or MAGs. The assembly step above is of course the first and most vital part of this. Let us now briefly have a look at the remaining steps before we have some MAGs. The first step is binning.

The contigs we get from the assembly will in general be from different genomes. The binning of the contigs simply means we cluster or group the contigs, such that those in the same group we presume are from the same genome. This is especially important when the assembly results in many (and smallish) contigs, which is often the case for short read (Illumina) data.

There are two pieces of information often used when binning contigs:

  • The mean coverage, either read coverage or K-mer coverage, for each contig
  • The sequence content, e.g. the GC-fraction or something more advanced, for each contig.

All contigs from the same genome should have similar coverage. As long as genomes have (very) different abundance, it should be possible to distinguish the contigs from this information alone. However, organisms with similar abundances may of course occur in our data. But, prokaryotes may also vary in their GC-content. If we have a fairly long contig, the GC-content can be computed with good precision. Thus, contigs from the same organism should also have similar GC-content. In reality the GC-content idea may be extended. Note that computing GC-content corresponds to the idea that a DNA sequence consists of 2 symbols only (G or C could be replaced by X, A or T replaced by Y), and we then count 1-mers! Of course, DNA has 4 letters, and we may instead count canonical K-mers with \(K>1\). Different genomes may have different patterns of such K-mers, and this may differ even if their GC-content is quite similar.

4.1 The maxbin idea

We will have a look at one of the many software tools available for binning, the maxbin.

First, read the original paper on maxbin:

The 2.0 version of this software (that we typically use) has a newer publication, but the basic ideas are the same, and therefore not described as well as in the original paper. Again we focus only on the Methods section, here is a guide:

  • What type of information is used from each contig?
  • How is the number of bins determined?
  • What is meant by Euclidean distance between tetranucleotide frequencies? (the \(dist(tetra(S_1), tetra(S_2))\))
  • What is meant by \(P_{dist}\)?
  • What is meant by \(P_{cov}\)?
  • An ambition is to understand the EM-algorithm idea described.

I have recorded a short video on this:



4.2 Using maxbin

Download the newest container for maxbin in the usual way from galaxy. Store it in your module8 folder.

Here is the essential code for binning the contigs from metaspades that were mentioned above, using the maxbin software:

##############
### Settings
###
threads=10
contigs=$COURSES/BIN310/module8/metaspades_contigs.fasta
r1=$COURSES/BIN310/module8/metagenome30_R1.fastq.gz
r2=$COURSES/BIN310/module8/metagenome30_R2.fastq.gz
out_dir=$SCRATCH/maxbin
prefix=$out_dir/metaspades
if [ ! -d $out_dir ]
then
  mkdir $out_dir
fi

###################
### Running maxbin
###
apptainer exec $HOME/module8/maxbin2:2.2.7--he1b5a44_2.sif run_MaxBin.pl \
-thread $threads -min_contig_length 500 -markerset 107 -contig $contigs \
-reads $r1 -reads2 $r2 -out $prefix

It is impossible to cluster too short contigs, their coverage, K-mer counts etc becomes too uncertain, hence a minimum contig length.

The -markerset option must be mentioned. This is a set of very prevalent prokaryotic genes, and maxbin will search for these in the contigs of each bin. Based on this, maxbin reports an estimate of how complete each bin is, i.e. how close we are to have recovered an entire genome. The default is to use the set of \(107\) such marker genes. If, however, your data are from an ‘alien’ environment lacking most of the prokaryotes we have in our public databases, you could switch to a smaller set of the \(40\) most ultra-common genes. Thus, you use either -markerset 107 or -markerset 40.

We also need to supply the read files, since maxbin involves a read-mapping (using bowtie2!) to compute read coverages for each contig.

The output are several files, all having the prefix we set in $prefix. For each bin (=cluster) there will be a fasta-file named <prefix>.001.fasta and so on with increasing numbers. This file lists all contigs assigned to the respective bin. The <prefix>.summary is a small text file summarizing each bin.

4.2.1 Exercise - plot maxbin results

First, make the shell script from the code above, and sbatch to get some maxbin results. You only need 10GB of memory and 1 hour wall time, this should not take more than a few minutes to complete.

Let us plot the contigs binned by maxbin. We plot each contig as a point in the coordinate system of GC-content and K-mer coverage. The dots should be colored by their bin and the point size should reflect the length of the contig.

Make an R-script where you read the fasta-files containing the binned contigs (the enumerated fasta-files in the $out_dir from the shell script above). Here is how we typically get listed the names of these files from R:

bin.files <- list.files("/mnt/SCRATCH/larssn/maxbin", pattern = "metaspades.+fasta", full.names = T)

We use the list.files() function. The first input is the folder whose content we want to list. The pattern specifies some pattern that is matched to the file names, and only files matching this are listed. If not specified, all files in the folder are listed.

First, create an empty table maxbin.tbl <- NULL. Then use a for-loop to read one file at a time. For each file you get a table (Fasta object). Then, mutate() in new columns:

  • Column length containing the length of each contig. This is the str_length() of the Sequence for each contig.
  • Column GC containing the GC-content of each contig. Must be computed from Sequence as well. Hint: Remove all A or T from the sequences, an then compute their lengths. Divide this by the full length.
  • Column Kmer_coverage containing the K-mer coverage of each contig. You find this in the Header of each contig, use word() and as.numeric().
  • Bin for each contig, put into column bin. This is the middle part of the fasta filename.
  • Use bind_rows() to bind this table to the existing maxbin.tbl and store the result into maxbin.tbl. Thus maxbin.tbl should grow with some rows for each iteration of the loop.

After this step you should have a table with many rows, and three columns in addition to Header and Sequence.

Plot the contigs as points, using GC on the x-axis, Kmer_coverage on the y-axis, map the point size to length and map the color to bin. It is a good idea to set the alpha (transparency) to 0.5 or something like that.

It could be an idea to only plot the contigs with length above 1000 or something like that, in order to make the plot a little less ‘noisy’.

Do the contigs from the various bins form groups in this figure?

4.2.2 Exercise solution

#!/bin/bash

#SBATCH --nodes=1
#SBATCH --reservation=BIN310
#SBATCH --account=bin310
#SBATCH --ntasks=10
#SBATCH --mem=10G
#SBATCH --time=01:00:00
#SBATCH --job-name=maxbin
#SBATCH --output=maxbin_%j.log

##############
### Settings
###
threads=10
contigs=$COURSES/BIN310/module8/metaspades_contigs.fasta
r1=$COURSES/BIN310/module8/metagenome30_R1.fastq.gz
r2=$COURSES/BIN310/module8/metagenome30_R2.fastq.gz
out_dir=$SCRATCH/maxbin
prefix=$out_dir/metaspades
if [ ! -d $out_dir ]
then
  mkdir $out_dir
fi

###################
### Running maxbin
###
apptainer exec $HOME/module8/maxbin2:2.2.7--he1b5a44_2.sif run_MaxBin.pl \
-thread $threads -min_contig_length 500 -markerset 107 -contig $contigs \
-reads $r1 -reads2 $r2 -out $prefix
library(tidyverse)
library(microseq)

bin.files <- list.files("/mnt/SCRATCH/larssn/maxbin", pattern = "metaspades.+fasta", full.names = T)
maxbin.tbl <- NULL
for(i in 1:length(bin.files)){
  maxbin.tbl <- readFasta(bin.files[i]) %>%
    mutate(length = str_length(Sequence)) %>%
    mutate(GC = str_length(str_remove_all(Sequence, "[AT]"))) %>%
    mutate(GC = GC / length) %>%
    mutate(Kmer_coverage = as.numeric(word(Header, 6, 6, sep = "_"))) %>%
    mutate(bin = str_remove_all(basename(bin.files[i]), "metaspades.|.fasta")) %>%
    bind_rows(maxbin.tbl)
}

p <- ggplot(maxbin.tbl) +
  geom_point(aes(x = GC, y = Kmer_coverage, color = bin, size = length), alpha = 0.5) +
  scale_y_log10()
print(p)

Each dot is a contig, colored by the bin it has been assigned to. The size of the dots is the length of the contig. We see the longer contigs also have larger K-mer coverages. In general the contigs of the same bin appear in clusters, as they should. I would not put much trust in the bins with meny very low coverage contigs.

4.2.3 Exercise - How well did maxbin actually bin the contigs?

Note that maxbin tries to group (bin) the contigs without any prior knowledge of how many genomes we have been sequencing, or how these genomes look like.

In an earlier exercise, we looked at the results of running metaquast on the same contigs that maxbin was clustering above. But then metaquast aligned all contigs to the actual ‘true’ reference genomes, the genomes from which the reads came originally. Hence, we presume most contigs where assigned to their correct genome by metaquast (unless contigs were really bad). This means we now have a golden opportunity to see if maxbin binned into the same bin those contigs who aligned to the same genome by metaquast.

Look into the directory where you output the metaquast results. In the subfolder runs_per_reference/ you find one folder for each reference genome. In each reference genome folder there is a folder named contigs_reports/ and in this a file named all_alignments_<something>.tsv. This table should give us information about which contigs where assigned to the corresponding reference genome (name of the folder above). NOTE: The data are formatted slightly strange in this file, but it should be possible to read with read_delim() (even if it gives some warnings). Read these tables, using a loop, much like you did in the previous exercise. Create a table named metaquast.tbl <- NULL, and use bind_rows() to grow the table in each iteration, as we did in the previous exercise. This table should have the columns Header (the contig Headers) and reference_genome that is in the file name. Now, this metaquast.tbl contains the ‘true’ binning of the contigs, i.e. contigs assigned to the same reference genome should also be binned together by maxbin.

How should we compare this binning to the one we got from maxbin?

From the previous exercise you have the maxbin.tbl, and you only need the columns bin, length and Header. Start by full_join() this with the metaquast.tbl, using the Header as the joining column. Next, remove all rows with missing data (NA) in them (use drop_na()). Then group the table by both bin and reference_genome and use summarise() to compute the sum of the lengths for each group.

If a bin is perfect then it should only occur once in the summarised table, which means all its contigs were also found in only one single reference genome. The quality of a bin can be seen from the distribution of the sum of lengths. A good bin means the vast majority of the contig lengths were from a single reference genome.

How good are the bins? Can you make a plot that illustrates how ‘contaminated’ each bin is?

4.2.4 Exercise solution

### First, run the script in the previous exercise

ref.dirs <- list.dirs("metaquast_spades/runs_per_reference", full.names = T, recursive = F)
metaquast.tbl <- NULL
for(i in 1:length(ref.dirs)){
  metaquast.tbl <- read_delim(file.path(ref.dirs[i], "contigs_reports", "all_alignments_metaspades_contigs.tsv"), delim = "\t") %>%
    drop_na(Contig) %>%
    select(Header = Contig) %>%
    mutate(reference_genome = basename(ref.dirs[i])) %>%
    bind_rows(metaquast.tbl)
}

final.tbl <- maxbin.tbl %>%
  select(bin, Header, length) %>%
  full_join(metaquast.tbl, by = "Header") %>%
  drop_na() %>%
  group_by(bin, reference_genome) %>%
  summarise(length = sum(length)) -> final.tbl

p2 <- ggplot(final.tbl) +
  geom_col(aes(x = bin, y = length, fill = reference_genome)) +
  coord_flip() +
  labs(y = "Size of bin (bases)", x = "maxbin bin") +
  theme(legend.position = "bottom", legend.title = element_blank()) +
  guides(fill = guide_legend(nrow = 10))
print(p2)

Each bar is a bin, and if the binning was in line with the quast alignments, then all bars should have one single color only, i.e. all contigs in the bin should have been aligned to the same reference genome by quast. It looks like bin 002, 003, 004, 005, 006 and perhaps 010 are pretty single-colored. The rest are rubbish! Mixtures of contigs from all kinds of reference genomes…

Note that in this case we know that all the 30 reference genomes were in the metagenome. Of these maxbin claims we re-construct 13, since there are 13 bins. Of these, perhaps 6 of them can be said to be close to re-constructed, based on the figure above. And even these are slightly contaminated with contigs from other genomes.

And this is based on simulated data (better than real data) from a metagenome with “only” 30 species. This indicates how difficult it is to re-construct genomes from metagenomes. It should be noted that with Nanopore data we most likely would have gotten better results.



4.3 The final MAGs

Other binning software tools may produce better results than what we got with maxbin above, but we can safely say that binning is a diffcult task and tools produce far from perfect results. We should always have this in mind, that the bins we get may to some degree be ‘mosaics’ in the sense that contigs will have some errors in them, and that the binning is actually grouping contigs who actually come from different genomes. For this reason it is not uncommon to use several different binning tools to bin the exact same contigs. This means they will come up with (slightly) different bins. The bins they agree upon we then trust more than the others.

Also, as mentioned previously, we may sequence several metagenomes from very similar environments, and expect that more or less the same genomes should be found in many of them. Instead of doing a co-assembly, as mentioned above, we may still assemble them separately, but then try to de-replicate the bins in the end.

Whether we have many similar bins due to binning the same contigs with several tools, or due to assembling several data sets from the same environment, we would now like to

  • Get some quality assessment of the bins. We will use the checkm2 software for this purpose.
  • Group together those bins who are actually (parts of) the same genome. We will use the dRep software for this purpose.

The checkm2 we use to get values for the completeness and contamination for each bin. The completeness is estimated by scanning the bin for housekeeping genes we expect all complete organisms to have. If 10% of these are lacking the completeness is 90%. The contamination is estimated by looking for genes that we expect to find in a single copy only. If there are two or more copies of some of these, it indicates we have a mosaic bin where contigs have been included that should perhaps not be there. This results in contamination values larger than 0. Ideally we should have completeness 100% and contamination 0%, but default thresholds are 75% and 25%, i.e. we accept some mosaic both ways.

We then use these results from checkm2 as input to dRep, together with the bins themselves. The dRep will simply group together bins who are so similar they most likely represent the same genome. Such bins may exists for at least two reasons:

  • We have assembled/binned several samples, and the same organism (genome) was found in two or more of these samples.
  • We have binned the contigs using two or more binning software tools, and the same bin was discovered by at least two them.

Download the container for checkm2 and for dRep. Make certain the latter is of version 3.0 or later, since previous versions behave slightly different.

Here is a shell script for running these to on the bins we got from maxbin earlier:

#!/bin/sh

#SBATCH --nodes=1
#SBATCH --reservation=BIN310
#SBATCH --account=bin310
#SBATCH --ntasks=10
#SBATCH --mem=50G
#SBATCH --job-name=MAGmaker
#SBATCH --output=MAGmaker_%j.log
#SBATCH --time=01:00:00

##############
### Settings
###
bins_folder=$SCRATCH/maxbin
checkm_folder=checkm
dRep_folder=dRep
min_completeness=75           # minimum percent completeness required for MAG (default 75)
max_contamination=25          # maximum percent contamination required for MAG (default 25)
threads=10
checkm_database=/mnt/databases/checkm2/uniref100.KO.1.dmnd

#######################################################
### Create folders
###
if [ ! -d $checkm_folder ]
then
  mkdir -p $checkm_folder
fi
if [ ! -d $dRep_folder ]
then
  mkdir -p $dRep_folder
fi

#####################
### Running checkm2
###
apptainer exec checkm2:1.0.1--pyh7cba7a3_0.sif checkm2 predict \
--extension .fasta \
--threads $threads \
--input $bins_folder \
--output-directory $checkm_folder \
--allmodels \
--database_path $checkm_database

############################################################
### Converting checkm results to .csv file for dRep input
###
module load R/4.3.1
Rscript -e "library(tidyverse)
checkm_folder <- commandArgs(trailingOnly = T)
checkm.tbl <- read_delim(file.path(checkm_folder, 'quality_report.tsv'), delim = '\\t') %>%
  mutate(genome = str_c(Name, '.fasta')) %>%
  select(genome, completeness = Completeness_General, contamination = Contamination)
write_delim(checkm.tbl, delim = ',', file = file.path(checkm_folder, 'checkm2dRep.csv'))" $checkm_folder

##################
### Running dRep
###
apptainer exec $HOME/module8/drep:3.4.5--pyhdfd78af_0.sif dRep dereplicate \
$dRep_folder \
-p $threads \
-g $bins_folder/*.fasta \
--genomeInfo $checkm_folder/checkm2dRep.csv \
--completeness $min_completeness \
--contamination $max_contamination

Make the script in your module8 folder and sbatch. Make sure the $bins_folder has the name you used when running maxbin above.

Notice we first run checkm2 on the bins from maxbin. This produces some output in the $checkm_folder. Then we use R to read the file quality_report.tsv and transform it slightly, and writing the resulting table to checkm2dRep.csv. This is required because dRep expects to get these results in a slightly different format than what checkm2 outputs. This is an excellent example of how we need some coding to adapt the output from one tool to work as input to the next!

We then run dRep, and notice we here set the quality threshold for our MAGs ($min_completeness and $max_contamination).

From the output we got from checkm2 and dRep we now make a table in R with the essential results:

library(tidyverse)

checkm_folder <- "checkm"
dRep_folder <- "dRep"

checkm.tbl <- read_delim(file.path(checkm_folder, "quality_report.tsv")) %>% 
  select(bin_id = Name, completeness = Completeness_General, contamination = Contamination)

MAG.tbl <- tibble(MAG_file = list.files(file.path(dRep_folder, "dereplicated_genomes"))) %>% 
  mutate(MAG_id = str_remove(MAG_file, ".fasta$")) %>% 
  left_join(checkm.tbl, by = c("MAG_id" = "bin_id"))
MAG.tbl <- read_delim(file.path(dRep_folder, "data_tables", "Cdb.csv"), delim = ",") %>% 
  select(MAG_file = genome, cluster = secondary_cluster) %>% 
  right_join(MAG.tbl, by = "MAG_file")

Inspect the table MAG.tbl. It looks like we ended with 6 MAGs given the quality thresholds and de-replication. If you inspect the checkm.tbl it becomes apparent that it was the completeness-threshold that eliminated some bins here.

The cluster column tells us which bins are actually the same genome. Here there is no grouping of any of them, which is as expected since we have only one sample, and have used only one software to produce these bins. Note the clustering is in two ‘levels’, e.g. 1_0. This means ‘primary cluster 1’ and then ‘secondary cluster 0’ inside the primary cluster. In some cases bins may be just different strains and then typically have the same primary cluster value, but differ in the second. The clustering of the bins/genomes is based on the software fastANI which we mentioned under K-mer based comparisons in Module 7. This is quite similar to mash software.

4.3.1 Exercise - compare MAGs to references

Since we now have the luxury of having the actual genomes that were in this metagenome, it would be interesting to see if the MAGs we got actually are very close to some of the 30 genomes we started out with.

Here are some approaches we could use

  • Use quast as we did for single genomes, and use one MAG at the time as contigs, using all reference genomes as we did for metaquast above. Ideally, the MAG should match 100% to one and only one of the reference genomes.
  • Compute mash distances between all genomes, like we did in Module 7, make a tree, and collect the distance from each MAG to its nearest neighbor among the reference genomes. A distance (very) close to 0 means we have a (very) good re-construction of some reference-genome.
  • Compute the pan-genome for all genomes using prodigal and roary, and make a pan-genome tree. Again find the nearest neighbor for each MAG, and see how many predicted genes they do not share. This requires more efforts, but indicate if whole genes are re-constructed in the MAG.

Use the metaquast approach mentioned above. I would evaluate one MAG at the time, and use all the reference genomes we used when we evaluated with metaquast above.

4.3.2 Exercise solution

The code for evaluating one of the MAGs:

#!/bin/bash

#SBATCH --nodes=1
#SBATCH --reservation=BIN310
#SBATCH --account=bin310
#SBATCH --ntasks=10
#SBATCH --mem=10G
#SBATCH --time=01:00:00
#SBATCH --job-name=metaquast
#SBATCH --output=metaquast_%j.log

##############
### Settings
###
threads=10
contigs=dRep/dereplicated_genomes/metaspades.004.fasta   # edit this
out_dir=metaquast_MAG
ref=$COURSES/BIN310/module8/ref_genomes

######################
### Running metaquast
###
apptainer exec $HOME/module4/quast:5.2.0--py39pl5321h4e691d4_3.sif metaquast.py \
--threads $threads -r $ref --output-dir $out_dir $contigs

Here is one figure from metaquast (the ugly scroll-bar solution for PDF-files seems difficult to avoid…):

knitr::include_graphics("fig/Genome_fraction.pdf")

We notice that the Corynebacterium seems to be the MAG we have re-constructed, but it also has some content from some of the other genomes.