Over the first modules we have focused on single genomes. We may have data from many different genomes, but each of them has been sequenced separately, as isolates, i.e. we isolate a single strain of bacteria on the lab, grow them, extract their DNA and sequence this.
We now turn our attention to metagenomes. This means we do not attempt to isolate single bacteria any more, but sequence all DNA we collect from some environment. This is often what we have to do, since isolating the bacteria is simply impossible, we have no methods for growing them in the lab.
A metagenome is not the genome of a single organism, but rather the collection of all different genomes found in some microbial community. The classical example is the metagenome of the human gut. Instead of trying to isolate the various species, we simply extract all DNA from all organisms in the gut, and sequence this. Metagenomics has become an important part of microbiology, and metagenome studies are conducted in many branches of biology these days, trying to understand the role of the microbes in the interplay with animals, plants, soil, water, air…
The course BIO326 is more specifically devoted to metagenomics and the re-construction of genomes. Hence, in this module of BIN310 we only look briefly into this topic, for more on this consider BIO326. We will also recognize computational solutions we used in previous modules, and in many ways the bioinformatics work is rather similar whether you have data from isolates or metagenomes. There are, however, some aspects of metagenome data that makes it necessary with some additional steps, and these are what we focus mostly on here.
maxbin
for doing this step.
Note that below you find code where these software tools are used, but we will not actually run them all this time. We have used most of these before, and below you find code illustrating their use for metagenome data
A metagenome is the collective genome of an entire microbial community. Metagenomics is a branch of genomics where we study such communities. We will devote the last modules of this course to metagenomics in some way.
Microbes and microbial communities are everywhere on this planet, and has ‘always’ been here. It is not until fairly recently that we have been able to study this branch of the biology at full range. The reason is that methods in microbiology have required the isolation of single strains. Only a tiny fraction of all microbes can be isolated this way. The rest, typically \(~99\%\), has been termed biological dark matter. The vast majority of biology on planet earth has been invisible to us! With todays sequencing technology we can sequence all DNA in a sample, without the need for isolation. Such data forms the basis for metagenomics, and for the first time in history we can really start to study the full microbial world.
Metagenome data are collected for many reasons. Most studies typically involve
Both these two purposes may just be a first step for subsequent
analyses, e.g. the composition may be used as features in a machine
learning predicting if something is wrong in an environment (human gut
diseases, food or environment contaminations etc.), and finding the
genomes may be a first step in further functional studies of what
biochemical/biological processes the microbes in the community may have.
So far we have only considered the sequencing of a single organism. This means we have a pool with DNA fragments from many cells, but these cells are all copies of each other. This means all fragments are from ‘the same genome’ in the sense that they all will work equally well as building blocks when we try to re-construct the genome. In this setting we also talked about read coverage.
With metagenome data, we have DNA fragments from several different cells, i.e. from different genomes. Assume we have a microbial community with \(M\) different organisms, i.e. there are \(M\) distinct genomes. Let \(G_1\), \(G_2\),…,\(G_M\) be the genome sizes of these community members, and let \(C_1\), \(C_2\),…,\(C_M\) be the genome abundances, i.e. fraction of genomes from each organism in the total number of genomes in the community. Thus, these fractions sum to 1: \[\sum_{i=1}^{M} C_i = 1.0\] We often refer to \(C_i\) as the relative abundance of genome \(i\), and the vector of abundances \({\bf C}=(C_1, C_2,...,C_M)\) is the genomic composition of the community.
When we do shotgun sequencing, we first fragment all genomes randomly, and then sequence these fragments. This means that a genome that has
will contribute with more such fragments to the total pool of fragments. Thus, the probability that a read comes from genome \(i\) should be \[P_i = G_i\cdot C_i / A\] where \(A = \sum_{i=1}^M G_i \cdot C_i\). We can think of \(A\) as the size of the fragment pool, and the probability of sequencing genome \(i\) is its fraction of this pool.
What is the read coverage for genome \(i\)? Well, if we have sequenced \(N\) reads in total, the expected number of reads coming from genome \(i\) is \(N_i = P_i N\). We then plug this \(N_i\) value into the formula for read coverage for each genome.
Let us simulate in R to get an impression of metagenome read coverages, and how many genomes in a metagenome we can hope to re-construct.
Let us make an R script where we simulate genomes sizes and relative abundances, and then compute read coverages for a given total number of reads \(N\). This is how we start out:
M <- 20 # number of genomes
G <- round(runif(M, min = 2000000, max = 5000000)) # typical bacterial genome sizes
Thus, the \(M\) genomes get some uniform random sizes between 2 and 5 mega-basepairs. Edit these limits if you like.
An important property of a microbial community is its diversity. A high diversity we get if there are many different organisms in the community. Also, their distribution matters. If there are many organisms, but only a very few of them dominate in abundance, the diversity is less compared to the situation where all organisms are more or less equally abundant.
When simulating we assume the distribution of relative abundances we assume is exponential. This is of course a simplification, but never very far from the truth, except for communities completely dominated by one or very few organisms. If we order the genomes from most to least abundant, the abundances are \[C_i = a e^{-b(i-1)}\] for \(i=1,...,M\), and where \(a\) is some scaling factor and \(b\) determines the shape of the distribution. Instead of choosing values for \(a\) and \(b\), we determine both by the dynamic range of the community. This is simply the largest (first) abundance divided by the smallest (last). If we set this to some fixed value, and since \(\sum_i C_i = 1.0\), both \(a\) and \(b\) follow. Add this to the script:
dynamic.range <- 10 # largest abundance is 10 times the smallest
b <- log(dynamic.range)/(M-1)
a <- 1 / (sum(exp(-b*(1:M-1))))
C <- a * exp(-b*(1:M-1))
Verify that
C
sum to 1.0
dynamic.range
You may also plot the abundances as a bar-chart, should be something
like this if you use the simple barplot()
:
Finally, add the code to compute the read coverage for each genome. Use \(N=1000000\) (one million). Use a typical Illumina read length \(L\). A rule of thumb says we need at least read coverage 10 to have a chance of re-constructing the genome. How many genomes may be re-constructed? Due to the random genome sizes, you need to re-run some times, and compute the average, to conclude.
Play around with different values and their combinations:
M <- 20 # number of genomes
G <- round(runif(M, min = 2000000, max = 5000000)) # genome sizes
dynamic.range <- 10 # largest abundance is 10 times the smallest
b <- log(dynamic.range)/(M-1)
a <- 1 / (sum(exp(-b*(1:M-1))))
C <- a * exp(-b*(1:M-1))
L <- 300 # 2 * 150
N.tot <- 10000000
A <- sum(G * C)
N <- round(N.tot * (G * C) / A)
D <- (N * L) / G
print(D)
## [1] 101.50686 89.92152 79.65852 70.56687 62.51284 55.37801 49.05761
## [8] 43.45842 38.49839 34.10446 30.21205 26.76385 23.70918 21.00320
## [15] 18.60601 16.48243 14.60125 12.93480 11.45847 10.15071
The same tools we used on data from single bacteria (isolates), may also be used for metagenome data. The software tools often have a ‘metagenome option’ that changes it behavior slightly, but apart from that the procedure is quite similar.
Before we start the assembly of the data, we need to consider the extra possibility of host contamination which is something we never mentioned for isolates.
When collecting DNA from a host, e.g. from a person, a fish or some animal, we often find that the cells from the host is very much present in our sample, and that a larger fraction of the total DNA is from the host, not the microbial community we are interested in. Thus, it is quite common to try to get rid of this contaminating DNA prior to assembly. When we talked about assembly in a previous module we mentioned data pre-processing, and host de-contamination is now another step in this. Note that this is only relevant for metagenomes collected from some other living organism that may contribute with contaminating DNA.
In order to do this, you need the genome of the host. Then, we typically use read-mapping that we saw in module 5. We map all the reads to the host genome, and then collect all the reads that did not map for our downstream analysis. It is very important that you have removed adapters prior to this step (another pre-processing), since reads with adapters in them may not map to the host even if they actually should.
Let us look at an example with data from a salmon gut. Here is the core part of a shell script for de-contaminating a pair of Illumina fastq files:
#############
### Settings
###
threads=10
r1=$COURSES/BIN310/module8/salmongut_R1.fastq.gz
r2=$COURSES/BIN310/module8/salmongut_R2.fastq.gz
host_index=$COURSES/BIN310/module8/salmon_genome
out_dir=$SCRATCH/decontaminated
r1_out=$out_dir/$(basename $r1)
r2_out=$out_dir/$(basename $r2)
if [ ! -d $out_dir ]
then
mkdir $out_dir
fi
###############################
### Readmapping to host and
### collecting unmapped reads
###
apptainer exec $HOME/module5/bowtie2:2.5.1--py39h6fed5c7_2.sif bowtie2 \
--threads $threads -x $host_index -1 $r1 -2 $r2 \
| apptainer exec $HOME/module5/samtools:1.17--hd87286a_1.sif samtools view \
--threads $threads -b - \
| apptainer exec $HOME/module5/samtools:1.17--hd87286a_1.sif samtools fastq \
-1 $r1_out -2 $r2_out -f 12 -
Note that you may need to edit the path to your containers to make this work.
In addition to the two input files ($r1
and
$r2
) we need the host genome, in this case the host is
salmon (Salmo salar). Here we use bowtie2
for the
read mapping, and an index made from the salmon genome is in
$host_index
. We output the files to some folder under
$SCRATCH
as usual. This is basically the same we did in
Module 5.
The subsequent processing with samtools
deserves a
comment. We saw in module 5 how we may use samtools
to
post-process the .sam
files produced by read mapping. Here
the output is piped directly into samtools view
, and its
output is again piped into samtools fastq
. The latter is
the crucial step, collecting some of the reads and writing them to the
new files ($r1_out
and r2_out
). The option
-f 12
means we specify to collect only the read-pairs where
both reads in a pair do not map to the genome. Let us
understand how this works.
Take a look at the website https://broadinstitute.github.io/picard/explain-flags.html.
By giving different values to the -f
option, we can
turn on various flags, i.e. change the criterion for what types
of reads we want to collect. From the website above, we can see there
are 12 different flags we can turn on, i.e. ‘read paired’, ‘read mapped
in proper pair’, ‘read unmapped’ etc. To indicate what we have turned
on, we can write this as a vector with 12 integers, where 0
means ‘turned off’ and 1
means ‘turned on’. Turning
everything off means we have the vector 000000000000
(12
zeros).
Now, think of 000000000000
as a binary number
with 12 bits (binary digits). This is a number in the number system with
2 as the basis instead of 10 that we are used to. Unfamiliar with the
binary number system? This is something you must learn! See https://www.mathsisfun.com/binary-number-system.html.
Bit number 12 corresponds to the first flag, bit number 11 correspond
to the second flag etc. We want to turn on flag number 3 and 4 (‘read
unmapped’ and ‘mate unmapped’ meaning both reads in a pair is unmapped).
This means we want the binary number 000000001100
, i.e. the
bits corresponding to flag 3 and 4 are set to 1
. Now, if
you translate this binary number into our usual decimal system, we get
the value 12! It is simply \[0\cdot 2^{11} +
0\cdot 2^{10} +...+ 0\cdot 2^4 + 1\cdot 2^3 + 1\cdot 2^2 + 0\cdot 2^1 +
0\cdot 2^0 = 12\] Try out the app in the website above, and see
how choosing flags changes the value, or vice versa.
Using binary digits like this is a clever way of finding all possible combinations of on/off…
If you try to build the shell script based on the above code, it will take hours to complete since there are quite many reads to map. But, you do not need to!
I have already run the de-contamination on these data, and the
resulting fastq files are in the same folder as the input files, with
decontaminated
in their names. Count how many read-pairs we
had originally, and how many we are left with after de-contamination
here. The files are big, and it may be a good idea to do this in the
Terminal rather than in R. Use zcat <filename>
and
then pipe this into wc -l
. This will then give you the
number of lines in the file, and then we know how many reads the fastq
file contains. Do this for the raw R1 and the decontaminated R1
file.
The host contamination varies a lot between organisms. Salmon (fish) gut seems to contain a lot of host cells, while in human guts the percentage is usually much lower (but not zero!). Having good wet-lab procedures to get rid of this prior to sequencing is of great value. It is a waste of sequencing efforts if we end up throwing away 90% of the reads before the real analysis starts!
zcat $cOURSES/BIN310/module8/salmongut_R1.fastq.gz | wc -l
147739372
zcat $cOURSES/BIN310/module8/salmongut_decontaminated_R1.fastq.gz | wc -l
39588516
# looks like we are left with around 26% of the reads after de-contamination, i.e. 74% were from the host!
In the files
$COURSES/BIN310/module8/metagenome30_R1.fastq.gz
$COURSES/BIN310/module8/metagenome30_R2.fastq.gz
you have Illumina HiSeq reads from a metagenome which is without
adapters and with no host DNA. Let us see how we can assemble this with
spades
. Here is the entire shell script, using the same
container we used in Module 4:
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --reservation=BIN310
#SBATCH --account=bin310
#SBATCH --ntasks=10
#SBATCH --mem=250G # We need memory for metagenomes
#SBATCH --time=10:00:00 # We need more time for metagenomes
#SBATCH --job-name=metaspades
#SBATCH --output=metaspades_%j.log
##############
### Settings
###
threads=10
memory=250
r1=$COURSES/BIN310/module8/metagenome30_R1.fastq.gz
r2=$COURSES/BIN310/module8/metagenome30_R2.fastq.gz
out_dir=$SCRATCH/metaspades
######################
### Running software
###
apptainer exec $HOME/module3/spades:3.15.5--h95f258a_1.sif metaspades.py \
-o $out_dir --tmp-dir $TMPDIR --memory $memory --threads $threads \
--pe-1 1 $r1 --pe-2 1 $r2
#############################
### Copying contigs to $HOME
###
cp $out_dir/contigs.fasta metaspades_contigs.fasta
It is actually quite similar to what we did in Module 4.
We use the metaspades.py
instead of only
spades.py
this time. In principle we could have used
spades.py
and the option --meta
. Note we also
reserve more memory, here 250 gigabytes, and we need the
hugemem
partition. The rest is quite similar to the
assembly of a single genome.
For the exact same metagenome, we also have Nanopore reads at the same coverage(s). These data are in
$COURSES/BIN310/module8/metagenome30.fastq.gz
Here is how we may use canu
as we did in Module 4 to
assemble these reads:
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --reservation=BIN310
#SBATCH --account=bin310
#SBATCH --ntasks=10
#SBATCH --mem=250G
#SBATCH --time=10:00:00
#SBATCH --job-name=metacanu
#SBATCH --output=metacanu_%j.log
##############
### Settings
###
threads=10
memory=250g
genome_size=5m
reads=$COURSES/BIN310/module8/metagenome30.fastq.gz
out_dir=$SCRATCH/metacanu
prefix=Nanopore
######################
### Running software
###
apptainer exec $HOME/module4/canu:2.2--ha47f30e_0.sif canu \
-d $out_dir -p $prefix genomeSize=$genome_size maxThreads=$threads maxMemory=$memory \
maxInputCoverage=10000 corOutCoverage=10000 corMhapSensitivity=high \
corMinCoverage=0 redMemory=32 oeaMemory=32 batMemory=200 \
-nanopore $reads
#############################
### Copying contigs to $HOME
###
cp $out_dir/$prefix.contigs.fasta metacanu_contigs.fasta
We have added some more options this time, some tweaks suggested by
the people behind canu
for use on metagenome data.
Suggestions like these will change over time and by who you ask, and we
do not spend time digging into this here now. You may look this up in
the (huge) user manual for canu
if you like.
Metagenome assemblies take a lot more time (hours) and resources than single-genome assemblies, and you are not required to actually run the scripts above. Instead, I have copied the resulting contigs to
$COURSES/BIN310/module8/metaspades_contigs.fasta
$COURSES/BIN310/module8/metacanu_contigs.fasta
Assembling a metagenome is of course more difficult than if all your reads are from a single genome. Reads from different organisms may overlap each other, and contigs may then become ‘mosaics’ of pieces actually belonging to different genomes. We also have a wildly varying coverage, and as we saw above, only the more abundant genome may actually be re-constructed to some degree.
metaquast
Evaluate the contigs we got from both the Illumina and Nanopore data,
listed above. Again we use the quast
software as we did in
Module 4. Copy the script from module 4, and make the required
changes.
The first change is to simply replace quast.py
with
metaquast.py
in the command line.
We would also like to give metaquast
some reference
genomes to compare against. In this case we also actually have the exact
genomes who were in the metagenome. This is of course never the case in
real life, but we may have some other genomes quite similar to those we
expect to find in the metagenome, and use these as references. The
reference genomes are found in
$COURSES/BIN310/module8/ref_genomes
Try to find out how you specify this to
metaquast.py
.
Run this evaluation for both sets of contigs (illumina and nanopore).
Output results for both metaspades_contigs.fasta
and
metacanu_contigs.fasta
to two different folders in your
module8
folder.
Open the report.html
file for both contig sets, and
compare. There are many statistics to consider here, but in the plot
that is shown, look up the Genome frac plot (click on ‘tabs’
above the plot). This lists how much of each genome was re-constructed.
This shows perhaps the most important difference between Illumina and
Nanopore data for metagenome assembly.
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --reservation=BIN310
#SBATCH --account=bin310
#SBATCH --ntasks=10
#SBATCH --mem=10G
#SBATCH --time=01:00:00
#SBATCH --job-name=metaquast
#SBATCH --output=metaquast_%j.log
##############
### Settings
###
threads=10
contigs=$COURSES/BIN310/module8/metaspades_contigs.fasta # comment in/out
out_dir=metaquast_spades # comment in/out
# contigs=$COURSES/BIN310/module8/metacanu_contigs.fasta # comment in/out
# out_dir=metaquast_canu # comment in/out
ref=$COURSES/BIN310/module8/ref_genomes
######################
### Running metaquast
###
apptainer exec $HOME/module4/quast:5.2.0--py39pl5321h4e691d4_3.sif metaquast.py \
--threads $threads -r $ref --output-dir $out_dir $contigs
Run first using the spades contigs, then comment out/in the use of canu contigs.
Here is a short video discussing the results.
We have seen that in a metagenome, the genomes with the lowest abundance will be difficult or even impossible to assemble. However, sometimes we sequence several, and quite similar, metagenomes. Instead of assembly them one by one, could we not just merge this into one huge meta-metagenome? If the low-abundance genomes are the same in all/most of these read sets, then merging it like this may give a high enough coverage to enable us to assemble some of these that we otherwise miss.
This is what we call co-assembly, i.e. we simply merge fastq-files from several data sets into a huge meta-metagenome, and work on this.
One problem with this approach is the use of resources. Each read-set is usually big, and merging them means we get a really huge read set. Do we have enough memory for this, even on a HPC? How long time will the co-assembly take? This is something you need to consider. One way to answer this is to do some smaller jobs, and then collect the resources used, just like we did in assignment 1.
Once a job has been completed, we may get listed how much memory was
consumed and how much CPU-time it took. The command for this is
sacct
, and here is an example of its use:
sacct -j <jobid> --format=maxrss,cputime,elapsed
where you replace <jobid>
by the proper
JOBID
for the job. The option --format=...
is
not required, but here we specify that we want to list
maxrss
. This is the maximum memory used during
the job. An average memory is of no interest since the bottleneck is
always the maximum needed. A job like assembly will allocate and free
memory several times during its computations, and we need to reserve
enough in #SBATCH --memory=
to meet this maximum.cputime
. This is the computing time, i.e. the number of
hours the actual processors are in use. This is what you pay for when
using sigma2!elapsed
. This is the actual time it took to complete
the job. When specifying the wall time in our scripts (the
#SBATCH --time=
), it must be larger than this.We will not do any co-assemblies here now, since the procedure is the same as for a single sample, you just merge multiple fastq files into a single pair of files before you start.
Genomes re-constructed from metagenomes are often referred to as Metagenome Assembled Genomes or MAGs. The assembly step above is of course the first and most vital part of this. Let us now briefly have a look at the remaining steps before we have some MAGs. The first step is binning.
The contigs we get from the assembly will in general be from different genomes. The binning of the contigs simply means we cluster or group the contigs, such that those in the same group we presume are from the same genome. This is especially important when the assembly results in many (and smallish) contigs, which is often the case for short read (Illumina) data.
There are two pieces of information often used when binning contigs:
All contigs from the same genome should have similar coverage. As
long as genomes have (very) different abundance, it should be possible
to distinguish the contigs from this information alone. However,
organisms with similar abundances may of course occur in our data. But,
prokaryotes may also vary in their GC-content. If we have a fairly long
contig, the GC-content can be computed with good precision. Thus,
contigs from the same organism should also have similar GC-content. In
reality the GC-content idea may be extended. Note that computing
GC-content corresponds to the idea that a DNA sequence consists of 2
symbols only (G or C could be replaced by X, A or T replaced by Y), and
we then count 1-mers! Of course, DNA has 4 letters, and we may instead
count canonical K-mers with \(K>1\).
Different genomes may have different patterns of such K-mers, and this
may differ even if their GC-content is quite similar.
maxbin
ideaWe will have a look at one of the many software tools available for
binning, the maxbin
.
First, read the original paper on maxbin
:
The 2.0 version of this software (that we typically use) has a newer publication, but the basic ideas are the same, and therefore not described as well as in the original paper. Again we focus only on the Methods section, here is a guide:
I have recorded a short video on this:
maxbin
Download the newest container for maxbin
in the usual
way from galaxy. Store it in your module8
folder.
Here is the essential code for binning the contigs from
metaspades
that were mentioned above, using the
maxbin
software:
##############
### Settings
###
threads=10
contigs=$COURSES/BIN310/module8/metaspades_contigs.fasta
r1=$COURSES/BIN310/module8/metagenome30_R1.fastq.gz
r2=$COURSES/BIN310/module8/metagenome30_R2.fastq.gz
out_dir=$SCRATCH/maxbin
prefix=$out_dir/metaspades
if [ ! -d $out_dir ]
then
mkdir $out_dir
fi
###################
### Running maxbin
###
apptainer exec $HOME/module8/maxbin2:2.2.7--he1b5a44_2.sif run_MaxBin.pl \
-thread $threads -min_contig_length 500 -markerset 107 -contig $contigs \
-reads $r1 -reads2 $r2 -out $prefix
It is impossible to cluster too short contigs, their coverage, K-mer counts etc becomes too uncertain, hence a minimum contig length.
The -markerset
option must be mentioned. This is a set
of very prevalent prokaryotic genes, and maxbin
will search
for these in the contigs of each bin. Based on this, maxbin
reports an estimate of how complete each bin is, i.e. how close
we are to have recovered an entire genome. The default is to use the set
of \(107\) such marker genes. If,
however, your data are from an ‘alien’ environment lacking most of the
prokaryotes we have in our public databases, you could switch to a
smaller set of the \(40\) most
ultra-common genes. Thus, you use either -markerset 107
or
-markerset 40
.
We also need to supply the read files, since maxbin
involves a read-mapping (using bowtie2
!) to compute read
coverages for each contig.
The output are several files, all having the prefix we set in
$prefix
. For each bin (=cluster) there will be a fasta-file
named <prefix>.001.fasta
and so on with increasing
numbers. This file lists all contigs assigned to the respective bin. The
<prefix>.summary
is a small text file summarizing
each bin.
maxbin
resultsFirst, make the shell script from the code above, and
sbatch
to get some maxbin
results. You only
need 10GB of memory and 1 hour wall time, this should not take more than
a few minutes to complete.
Let us plot the contigs binned by maxbin
. We plot each
contig as a point in the coordinate system of GC-content and K-mer
coverage. The dots should be colored by their bin and the point size
should reflect the length of the contig.
Make an R-script where you read the fasta-files containing the binned
contigs (the enumerated fasta-files in the $out_dir
from
the shell script above). Here is how we typically get listed the names
of these files from R:
bin.files <- list.files("/mnt/SCRATCH/larssn/maxbin", pattern = "metaspades.+fasta", full.names = T)
We use the list.files()
function. The first input is the
folder whose content we want to list. The pattern
specifies
some pattern that is matched to the file names, and only files matching
this are listed. If not specified, all files in the folder are
listed.
First, create an empty table maxbin.tbl <- NULL
. Then
use a for-loop to read one file at a time. For each file you get a table
(Fasta object). Then, mutate()
in new columns:
length
containing the length of each contig.
This is the str_length()
of the Sequence
for
each contig.GC
containing the GC-content of each contig.
Must be computed from Sequence
as well. Hint: Remove all
A
or T
from the sequences, an then compute
their lengths. Divide this by the full length.Kmer_coverage
containing the K-mer coverage of
each contig. You find this in the Header
of each contig,
use word()
and as.numeric()
.bin
. This is the
middle part of the fasta filename.bind_rows()
to bind this table to the existing
maxbin.tbl
and store the result into
maxbin.tbl
. Thus maxbin.tbl
should grow with
some rows for each iteration of the loop.After this step you should have a table with many rows, and three
columns in addition to Header
and
Sequence
.
Plot the contigs as points, using GC
on the x-axis,
Kmer_coverage
on the y-axis, map the point size to
length
and map the color to bin
. It is a good
idea to set the alpha
(transparency) to 0.5 or something
like that.
It could be an idea to only plot the contigs with length above 1000 or something like that, in order to make the plot a little less ‘noisy’.
Do the contigs from the various bins form groups in this figure?
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --reservation=BIN310
#SBATCH --account=bin310
#SBATCH --ntasks=10
#SBATCH --mem=10G
#SBATCH --time=01:00:00
#SBATCH --job-name=maxbin
#SBATCH --output=maxbin_%j.log
##############
### Settings
###
threads=10
contigs=$COURSES/BIN310/module8/metaspades_contigs.fasta
r1=$COURSES/BIN310/module8/metagenome30_R1.fastq.gz
r2=$COURSES/BIN310/module8/metagenome30_R2.fastq.gz
out_dir=$SCRATCH/maxbin
prefix=$out_dir/metaspades
if [ ! -d $out_dir ]
then
mkdir $out_dir
fi
###################
### Running maxbin
###
apptainer exec $HOME/module8/maxbin2:2.2.7--he1b5a44_2.sif run_MaxBin.pl \
-thread $threads -min_contig_length 500 -markerset 107 -contig $contigs \
-reads $r1 -reads2 $r2 -out $prefix
library(tidyverse)
library(microseq)
bin.files <- list.files("/mnt/SCRATCH/larssn/maxbin", pattern = "metaspades.+fasta", full.names = T)
maxbin.tbl <- NULL
for(i in 1:length(bin.files)){
maxbin.tbl <- readFasta(bin.files[i]) %>%
mutate(length = str_length(Sequence)) %>%
mutate(GC = str_length(str_remove_all(Sequence, "[AT]"))) %>%
mutate(GC = GC / length) %>%
mutate(Kmer_coverage = as.numeric(word(Header, 6, 6, sep = "_"))) %>%
mutate(bin = str_remove_all(basename(bin.files[i]), "metaspades.|.fasta")) %>%
bind_rows(maxbin.tbl)
}
p <- ggplot(maxbin.tbl) +
geom_point(aes(x = GC, y = Kmer_coverage, color = bin, size = length), alpha = 0.5) +
scale_y_log10()
print(p)
Each dot is a contig, colored by the bin it has been assigned to. The size of the dots is the length of the contig. We see the longer contigs also have larger K-mer coverages. In general the contigs of the same bin appear in clusters, as they should. I would not put much trust in the bins with meny very low coverage contigs.
maxbin
actually bin the contigs?Note that maxbin
tries to group (bin) the contigs
without any prior knowledge of how many genomes we have been sequencing,
or how these genomes look like.
In an earlier exercise, we looked at the results of running
metaquast
on the same contigs that maxbin
was
clustering above. But then metaquast
aligned all contigs to
the actual ‘true’ reference genomes, the genomes from which the reads
came originally. Hence, we presume most contigs where assigned to their
correct genome by metaquast
(unless contigs were
really bad). This means we now have a golden opportunity to see if
maxbin
binned into the same bin those contigs who aligned
to the same genome by metaquast
.
Look into the directory where you output the metaquast
results. In the subfolder runs_per_reference/
you find one
folder for each reference genome. In each reference genome folder there
is a folder named contigs_reports/
and in this a file named
all_alignments_<something>.tsv
. This table should
give us information about which contigs where assigned to the
corresponding reference genome (name of the folder above). NOTE: The
data are formatted slightly strange in this file, but it should be
possible to read with read_delim()
(even if it gives some
warnings). Read these tables, using a loop, much like you did in the
previous exercise. Create a table named
metaquast.tbl <- NULL
, and use bind_rows()
to grow the table in each iteration, as we did in the previous exercise.
This table should have the columns Header
(the contig
Headers) and reference_genome
that is in the file name.
Now, this metaquast.tbl
contains the ‘true’ binning of the
contigs, i.e. contigs assigned to the same reference genome should also
be binned together by maxbin
.
How should we compare this binning to the one we got from
maxbin
?
From the previous exercise you have the maxbin.tbl
, and
you only need the columns bin
, length
and
Header
. Start by full_join()
this with the
metaquast.tbl
, using the Header
as the joining
column. Next, remove all rows with missing data (NA
) in
them (use drop_na()
). Then group the table by both
bin
and reference_genome
and use
summarise()
to compute the sum of the lengths for each
group.
If a bin
is perfect then it should only occur once in
the summarised table, which means all its contigs were also found in
only one single reference genome. The quality of a bin
can
be seen from the distribution of the sum of lengths. A good
bin
means the vast majority of the contig lengths were from
a single reference genome.
How good are the bins? Can you make a plot that illustrates how ‘contaminated’ each bin is?
### First, run the script in the previous exercise
ref.dirs <- list.dirs("metaquast_spades/runs_per_reference", full.names = T, recursive = F)
metaquast.tbl <- NULL
for(i in 1:length(ref.dirs)){
metaquast.tbl <- read_delim(file.path(ref.dirs[i], "contigs_reports", "all_alignments_metaspades_contigs.tsv"), delim = "\t") %>%
drop_na(Contig) %>%
select(Header = Contig) %>%
mutate(reference_genome = basename(ref.dirs[i])) %>%
bind_rows(metaquast.tbl)
}
final.tbl <- maxbin.tbl %>%
select(bin, Header, length) %>%
full_join(metaquast.tbl, by = "Header") %>%
drop_na() %>%
group_by(bin, reference_genome) %>%
summarise(length = sum(length)) -> final.tbl
p2 <- ggplot(final.tbl) +
geom_col(aes(x = bin, y = length, fill = reference_genome)) +
coord_flip() +
labs(y = "Size of bin (bases)", x = "maxbin bin") +
theme(legend.position = "bottom", legend.title = element_blank()) +
guides(fill = guide_legend(nrow = 10))
print(p2)
Each bar is a bin, and if the binning was in line with the
quast
alignments, then all bars should have one single
color only, i.e. all contigs in the bin should have been aligned to the
same reference genome by quast
. It looks like bin 002, 003,
004, 005, 006 and perhaps 010 are pretty single-colored. The rest are
rubbish! Mixtures of contigs from all kinds of reference genomes…
Note that in this case we know that all the 30 reference
genomes were in the metagenome. Of these maxbin
claims we
re-construct 13, since there are 13 bins. Of these, perhaps 6 of them
can be said to be close to re-constructed, based on the figure above.
And even these are slightly contaminated with contigs from other
genomes.
And this is based on simulated data (better than real data) from a metagenome with “only” 30 species. This indicates how difficult it is to re-construct genomes from metagenomes. It should be noted that with Nanopore data we most likely would have gotten better results.
Other binning software tools may produce better results than what we
got with maxbin
above, but we can safely say that binning
is a diffcult task and tools produce far from perfect results. We should
always have this in mind, that the bins we get may to some degree be
‘mosaics’ in the sense that contigs will have some errors in them, and
that the binning is actually grouping contigs who actually come from
different genomes. For this reason it is not uncommon to use several
different binning tools to bin the exact same contigs. This means they
will come up with (slightly) different bins. The bins they agree upon we
then trust more than the others.
Also, as mentioned previously, we may sequence several metagenomes from very similar environments, and expect that more or less the same genomes should be found in many of them. Instead of doing a co-assembly, as mentioned above, we may still assemble them separately, but then try to de-replicate the bins in the end.
Whether we have many similar bins due to binning the same contigs with several tools, or due to assembling several data sets from the same environment, we would now like to
checkm2
software for this purpose.dRep
software for this
purpose.The checkm2
we use to get values for the
completeness and contamination for each bin. The
completeness is estimated by scanning the bin for housekeeping genes we
expect all complete organisms to have. If 10% of these are lacking the
completeness is 90%. The contamination is estimated by looking for genes
that we expect to find in a single copy only. If there are two
or more copies of some of these, it indicates we have a mosaic bin where
contigs have been included that should perhaps not be there. This
results in contamination values larger than 0. Ideally we should have
completeness 100% and contamination 0%, but default thresholds are 75%
and 25%, i.e. we accept some mosaic both ways.
We then use these results from checkm2
as input to
dRep
, together with the bins themselves. The
dRep
will simply group together bins who are so similar
they most likely represent the same genome. Such bins may exists for at
least two reasons:
Download the container for checkm2
and for
dRep
. Make certain the latter is of version 3.0 or later,
since previous versions behave slightly different.
Here is a shell script for running these to on the bins we got from
maxbin
earlier:
#!/bin/sh
#SBATCH --nodes=1
#SBATCH --reservation=BIN310
#SBATCH --account=bin310
#SBATCH --ntasks=10
#SBATCH --mem=50G
#SBATCH --job-name=MAGmaker
#SBATCH --output=MAGmaker_%j.log
#SBATCH --time=01:00:00
##############
### Settings
###
bins_folder=$SCRATCH/maxbin
checkm_folder=checkm
dRep_folder=dRep
min_completeness=75 # minimum percent completeness required for MAG (default 75)
max_contamination=25 # maximum percent contamination required for MAG (default 25)
threads=10
checkm_database=/mnt/databases/checkm2/uniref100.KO.1.dmnd
#######################################################
### Create folders
###
if [ ! -d $checkm_folder ]
then
mkdir -p $checkm_folder
fi
if [ ! -d $dRep_folder ]
then
mkdir -p $dRep_folder
fi
#####################
### Running checkm2
###
apptainer exec checkm2:1.0.1--pyh7cba7a3_0.sif checkm2 predict \
--extension .fasta \
--threads $threads \
--input $bins_folder \
--output-directory $checkm_folder \
--allmodels \
--database_path $checkm_database
############################################################
### Converting checkm results to .csv file for dRep input
###
module load R/4.3.1
Rscript -e "library(tidyverse)
checkm_folder <- commandArgs(trailingOnly = T)
checkm.tbl <- read_delim(file.path(checkm_folder, 'quality_report.tsv'), delim = '\\t') %>%
mutate(genome = str_c(Name, '.fasta')) %>%
select(genome, completeness = Completeness_General, contamination = Contamination)
write_delim(checkm.tbl, delim = ',', file = file.path(checkm_folder, 'checkm2dRep.csv'))" $checkm_folder
##################
### Running dRep
###
apptainer exec $HOME/module8/drep:3.4.5--pyhdfd78af_0.sif dRep dereplicate \
$dRep_folder \
-p $threads \
-g $bins_folder/*.fasta \
--genomeInfo $checkm_folder/checkm2dRep.csv \
--completeness $min_completeness \
--contamination $max_contamination
Make the script in your module8
folder and
sbatch
. Make sure the $bins_folder
has the
name you used when running maxbin
above.
Notice we first run checkm2
on the bins from
maxbin
. This produces some output in the
$checkm_folder
. Then we use R
to read the file
quality_report.tsv
and transform it slightly, and writing
the resulting table to checkm2dRep.csv
. This is required
because dRep
expects to get these results in a slightly
different format than what checkm2
outputs. This is an
excellent example of how we need some coding to adapt the output from
one tool to work as input to the next!
We then run dRep
, and notice we here set the quality
threshold for our MAGs ($min_completeness
and
$max_contamination
).
From the output we got from checkm2
and
dRep
we now make a table in R with the essential
results:
library(tidyverse)
checkm_folder <- "checkm"
dRep_folder <- "dRep"
checkm.tbl <- read_delim(file.path(checkm_folder, "quality_report.tsv")) %>%
select(bin_id = Name, completeness = Completeness_General, contamination = Contamination)
MAG.tbl <- tibble(MAG_file = list.files(file.path(dRep_folder, "dereplicated_genomes"))) %>%
mutate(MAG_id = str_remove(MAG_file, ".fasta$")) %>%
left_join(checkm.tbl, by = c("MAG_id" = "bin_id"))
MAG.tbl <- read_delim(file.path(dRep_folder, "data_tables", "Cdb.csv"), delim = ",") %>%
select(MAG_file = genome, cluster = secondary_cluster) %>%
right_join(MAG.tbl, by = "MAG_file")
Inspect the table MAG.tbl
. It looks like we ended with 6
MAGs given the quality thresholds and de-replication. If you inspect the
checkm.tbl
it becomes apparent that it was the
completeness-threshold that eliminated some bins here.
The cluster
column tells us which bins are actually the
same genome. Here there is no grouping of any of them, which is as
expected since we have only one sample, and have used only one software
to produce these bins. Note the clustering is in two ‘levels’,
e.g. 1_0
. This means ‘primary cluster 1’ and then
‘secondary cluster 0’ inside the primary cluster. In some cases bins may
be just different strains and then typically have the same primary
cluster value, but differ in the second. The clustering of the
bins/genomes is based on the software fastANI
which we
mentioned under K-mer based comparisons in Module 7. This is
quite similar to mash
software.
Since we now have the luxury of having the actual genomes that were in this metagenome, it would be interesting to see if the MAGs we got actually are very close to some of the 30 genomes we started out with.
Here are some approaches we could use
quast
as we did for single genomes, and use one MAG
at the time as contigs, using all reference genomes as we did for
metaquast
above. Ideally, the MAG should match 100% to one
and only one of the reference genomes.mash
distances between all genomes, like we did
in Module 7, make a tree, and collect the distance from each MAG to its
nearest neighbor among the reference genomes. A distance (very) close to
0 means we have a (very) good re-construction of some
reference-genome.prodigal
and roary
, and make a pan-genome tree. Again find the
nearest neighbor for each MAG, and see how many predicted genes they do
not share. This requires more efforts, but indicate if whole genes are
re-constructed in the MAG.Use the metaquast
approach mentioned above. I would
evaluate one MAG at the time, and use all the reference genomes we used
when we evaluated with metaquast
above.
The code for evaluating one of the MAGs:
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --reservation=BIN310
#SBATCH --account=bin310
#SBATCH --ntasks=10
#SBATCH --mem=10G
#SBATCH --time=01:00:00
#SBATCH --job-name=metaquast
#SBATCH --output=metaquast_%j.log
##############
### Settings
###
threads=10
contigs=dRep/dereplicated_genomes/metaspades.004.fasta # edit this
out_dir=metaquast_MAG
ref=$COURSES/BIN310/module8/ref_genomes
######################
### Running metaquast
###
apptainer exec $HOME/module4/quast:5.2.0--py39pl5321h4e691d4_3.sif metaquast.py \
--threads $threads -r $ref --output-dir $out_dir $contigs
Here is one figure from metaquast
(the ugly scroll-bar
solution for PDF-files seems difficult to avoid…):
knitr::include_graphics("fig/Genome_fraction.pdf")
We notice that the Corynebacterium seems to be the MAG we have re-constructed, but it also has some content from some of the other genomes.