1 Learning goals

This module we devote entirely to the binning problem. This arises when we do whole genome metagenome sequencing, we have assembled the reads into contigs and we now ask: Which contigs come from the same genome? This builds on what we have done in the previous modules and assignment 2.

  • Learn what is meant by binning, and understand what type of information in the contigs we use to group them.
  • Learn how to use the software maxbin for doing this step.
  • Learn how to do the exact same type of binning we did with maxbin2 using tools like metabat2 and concoct to obtain alternative solutions.
  • Learn how we can combine the results of several binners into a final set of ensemble bins.



1.1 Software used in this module





2 Metagenome Assembled Genomes (MAGs)

When we do a Whole Genome Sequencing of metagenomes we are often interested in trying to re-construct at least some of the genomes in the microbial community we sequenced. Such re-constructed genomes we often refer to as MAGs.

We saw already in the very last part of Module 6 that in general we cannot expect to be able to re-construct all genomes in such communities, only those we have sequenced deep enough (high enough read coverage). Remember that in a metagenome there may be many different genomes, at very different abundances, and only the more abundant ones we can hope to re-construct.

Being able to re-construct genomes from metagenomes may be of great importance. First, we may of course reveal the genome of entirely new taxa this way, organisms never described before. Second, since these are typically the most abundant taxa, we may also presume they are among the most important ones for how the microbial community works. Imagine we want to understand the roles of the microbes in the gut of farmed salmon, in order to design fish feed which is effective and healthy to the fish. Then finding the genomes of these microbes, and see what kind of genes they contain and how these are expressed, is crucial for being able to understand how different feeds affect the fish digestion system (see Genomic and functional characterization of the Atlantic salmon gut microbiome in relation to nutrition and health).

The assembly step we have seen in Module 7 (and assignment 2) is of course the first part of this. We first have to piece together the reads into longer fragments of the genomes. However, we also saw in assignment 2 that even after assembly we still have a rather large number of contigs. We have managed to piece together many small puzzle-pieces into larger ones, but we still have many of them, and typically none of them are complete genomes.

The next step we take is what we refer to as binning.



2.1 Binning

The contigs we get from the assembly will in general be from several different genomes, and on the other hand, several of them are from the same genome. The binning of the contigs simply means we cluster or group the contigs, such that those in the same group are from the same genome. This is especially important when the assembly results in many (and smallish) contigs, which is often the case for short read (Illumina) data. For Nanopore data we can hope for some very long contigs spanning entire genomes, but in reality this is often not the case when we work with metagenome data.

Think back to the metabarcoding data we processed already in Module 2. There we also did a similar clustering! We have a large number of reads from some amplicons, and we wanted to group those coming from the same organism into the same groups. We used tools like Rsearch or dada2 for this. Could we use the same tools here?

No we can not. With metabarcoding data all reads are from the exact same gene, and those from the same organism must here for be (very) similar to each other. Now we have a set of thousands of contigs, and even if two contigs are from the same genome, they are not from the same part of that genome. Hence, we cannot simply align them and see if they are similar. It is more like having different body parts from different animals, and trying to decide which come from the same animal. Then a leg and a head from the same animal are not very similar to each other, and we cannot simply form clusters or groups based directly on how the body parts look like.

Instead of sequence similarity we have to look for other characteristics of the various contigs in order to cluster or bin them. There are two pieces of information often used when binning contigs:

  • The mean coverage, either read coverage or K-mer coverage, for each contig.
  • Some sequence content statistics, e.g. the GC-fraction or something more advanced, for each contig.

All contigs from the same genome should have similar coverage. As long as genomes have (very) different abundance, it should be possible to distinguish the contigs from this information alone. However, different organisms with similar abundances may of course occur in our data and we can rarely base the binning only on this criterion.

But, prokaryotes also vary a lot in their GC-content. If we have a fairly long contig, the GC-content can be computed with good precision. Thus, contigs from the same organism should also have similar GC-content. In reality the GC-content idea may be extended. Note that computing GC-content corresponds to the idea that a DNA sequence consists of 2 symbols only (G or C could be replaced by X, A or T replaced by Y), and we then count 1-mers! Of course, DNA has 4 letters, and we may instead count canonical K-mers with \(K>1\). Different genomes may have different patterns of such K-mers, and this may differ even if their GC-content is quite similar.

When we did the assembly we used only a single tool, the spades for Illumina and flye for Nanopore data. Even if there are other tools around, the results we get are typically not that different between assembly tools. With binning it is different. Different tools can sometimes produce very different results! For this reason we will not just stick to a single way of doing this. We prefer the following approach:

  • Cluster the same contigs using three different binning tools (maxbin2, metabat2, concoct), producing three alternative sets of bins for the same set of contigs.
  • Use en ensemble tool (DASTool) to work out which of all these alternative bins are the ones we should trust and keep as our final bins.

Below we will first work our way through the three binning tools, and then round off by using the ensemble binner to get the final result for how we should group our contigs.



2.2 The maxbin idea

We will have a look ‘under the hood’ at one of the many software tools available for binning, the maxbin2.

First, read the original paper on maxbin:

The 2.0 version of this software (that we typically use) has a newer publication, but the basic ideas are the same, and therefore not described as well as in the original paper. Again we focus only on the Methods section, here is a guide:

  • What type of information is used from each contig?
  • How is the number of bins determined?
  • What is meant by Euclidean distance between tetranucleotide frequencies? (the \(dist(tetra(S_1), tetra(S_2))\))
  • What is meant by \(P_{dist}\)?
  • What is meant by \(P_{cov}\)?
  • An ambition is to understand the EM-algorithm idea described.

I have recorded a short video on this:



2.3 Abundances for each contig

We saw that maxbin needs to know the read coverage for each contig. In the previous module we learnt how to map reads to some contigs. Whether we use bowtie2 (short reads) or minimap2 (long reads) for this we always use samtools to compute the read coverage for each position in all contigs. This is then output as a huge table with 3 columns:

  • The contig_id
  • The position
  • The coverage value

From such a table we can easily make R code that group rows by the contig_id information and then compute

  • The contig length, which is then simply the number (n()) of rows for each group.
  • The mean coverage, which is then simply the mean (mean()) of the coverage values
  • The coverage variance, which is then simply the variance (var()) of the coverage values.

We will see below that all our binning tools require an abundance table containing at least some of these columns. We therefore take the effort to compute this once and for all, and then use it later for binning.

2.3.1 Exercise - compute abundance table

In Module 8 we did some read mapping against some contigs. In the last exercise of Module 8 you were asked to first filter the contigs by length, only keeping the contigs of length 1000 or more and then do the read mapping against these. You should do this exercise yourself, but you also find the results from this in

  • /mnt/courses/BIN310/module9/illumina_contigs.fasta, which is the fasta file with all the contigs of length 1000 or more, and
  • /mnt/courses/BIN310/module9/coverage_illumina.txt, which is the result of read mapping and computing the coverage for each position along these contigs.

Make R code to compute from the output above a table with four columns:

  • First column list the contig_id’s.
  • Second column the length of each contig.
  • Third column the mean coverage of each contig.
  • Forth column the variance of the coverages for each contig.

Make a folder named bins/, and write this table to a tab-separated text file named abundance.tsv in that folder. We will also put the binning results into this folder below.

Note that the coverage file you read into R is huge, use fread() from data.table to read it (see Module 8 exercises).

2.3.2 Exercise solution

library(tidyverse)
library(data.table)

cov.file <- "/mnt/courses/BIN310/module9/coverage_illumina.txt"
abundance.file <- "bins/abundance.tsv"

ctg.tbl <- fread(cov.file, col.names = c('contig_id', 'position', 'coverage')) |>
  group_by(contig_id) |>
  summarise(length = n(), mean_coverage = mean(coverage), var_coverage = var(coverage)) |>
  ungroup() |>
  select(contig_id, length, mean_coverage, var_coverage)
write_tsv(ctg.tbl, file = abundance.file)

Note that it would perhaps be natural to have this as a chunk of R code inside the shell script where we did the read mapping, as a very last part of that script. Here we have it as a separate R script, and this is also an option.



2.4 Using maxbin2 on Orion

Let us try maxbin2 to bin/group the contigs in $COURSES/BIN310/module8/illumina_contigs.fasta. Download the newest container for maxbin2 in the usual way from galaxy. Store it in your module9 folder.

Here is the essential code for binning the contigs, using the maxbin2 software, and where we need the file from the previous exercise:

##########################
### Settings
###
threads=10
contigs_file=$COURSES/BIN310/module9/illumina_contigs.fasta
bins_dir=bins
abundance_file=$bins_dir/abundance.tsv
out_dir=$SCRATCH/binning
prefix=$out_dir/maxbin
new_abundance_file=$out_dir/abundance_max.tsv

####################
### Initialization
###
if [ ! -d $out_dir ]
then
    mkdir -p $out_dir
fi
if [ ! -d $bins_dir ]
then
    mkdir -p $bins_dir
fi
module load R/4.4.2

############################
### Editing abundance table
###
Rscript -e "args <- commandArgs(trailingOnly = T)
library(tidyverse)
abundance.file <- args[1]
new.abundance.file <- args[2]
abd.tbl <- read_tsv(abundance.file) |> 
  select(contig_id, mean_coverage)
write_tsv(abd.tbl, file = new.abundance.file)
" $abundance_file $new_abundance_file

#####################
### Running maxbin2
###
apptainer exec maxbin2:2.2.7--he1b5a44_2.sif run_MaxBin.pl \
 -thread $threads \
 -min_contig_length 1000 \
 -contig $contigs_file \
 -abund $new_abundance_file \
 -out $prefix

#######################
### Copying bin files
###
Rscript -e "args <- commandArgs(trailingOnly = T)
library(tidyverse)
out.dir <- args[1]
bins.dir <- args[2]
prefix <- args[3]
bin.files <- list.files(out.dir, pattern = str_c(basename(prefix), '.[0-9]+.fasta'))
ok <- file.copy(from = file.path(out.dir, bin.files),
                to = file.path(bins.dir, bin.files),
                overwrite = T)
" $out_dir $bins_dir $prefix

Note how we direct all output to a folder under SCRATCH which we refer to as out_dir. Then, at the end we copy the fasta files, one for each bin, to the folder specified in bins_dir. You may then delete this temporary folder under $SCRATCH.

Both out_dir and bins_dir are created unless they already exist, and we load R.

Some lines of R code are included to just read the coverage table we produced above, and then store a new table containing only two of the column in a new file (new_abundance_file). This is required since maxbin2 will only have two columns in the input abundance file:

  • The contig_id in the first column
  • The mean coverage values in the second

Thus, we have to edit this input before it used. Note that in the maxbin2 command line it is this new abundance file we use, not the original one.

We use the -min_contig_length option to maxbin2 here. It is impossible to cluster too short contigs, their coverage, K-mer counts etc becomes too uncertain, hence a minimum contig length of 1000 has been chosen here. The default is 500.

We supply the read coverage information in new_abundance_file, using the -abund option, since we have already computed this. You may actually run maxbin2 with this input, but then you need to supply the read files instead. The maxbin2 will then do the read mapping itself (using bowtie2!) and compute read coverages for each contig. The reason we do this as a separate step first, is that we will use several binning tools, and they all require this information. It is then sensible to do this once, and use its results each time.

The output from maxbin2 are several files, all having the prefix we set in $prefix. For each bin (=cluster) there will be a fasta-file named <prefix>.001.fasta and so on with increasing numbers. This file contains all contigs assigned to the respective bin. The <prefix>.summary is a small text file summarizing each bin.

We have added another piece of R code at the end, where we simply copy all the bin files from out_dir to a more permanent folder bins_dir. This is the only output we want to keep.

2.4.1 Exercise - plot maxbin2 results

First, make the shell script from the code above, and sbatch to get some maxbin results in the bins_dir. You only need 10GB of memory and 1 hour wall time, this should not take more than a few minutes to complete.

Plot the binned contigs as a scatter-plot (points), where you use the contig length on the x-axis, the contig mean coverage on the y-axis and you color by bin_id. The bin_id is simply the name of the various fasta files you ended with (remove the .fasta extension). Use log-transformed axes.

Here is a code skeleton to help you start:

library(tidyverse)
library(microseq)

bins.dir <- ___
bin.files <- list.files(___, pattern = "maxbin")
bin.tbl <- NULL
for(i in 1:length(___)){
  bin.tbl <- readFasta(file.path(bins.dir, bin.files[i])) |> 
    select(contig_id = Header) |> 
    mutate(bin_id = str_remove(___, ".fasta")) |> 
    bind_rows(bin.tbl)
}
bin.tbl <- read_tsv("contig_coverages.tsv") |> 
  right_join(___, by = "contig_id")
fig <- ggplot(___) +
  geom_point(aes(x = ___, y = ___, color = ___). alpha = 0.3) +
  scale_x_log10() + scale_y_log10()
print(___)

2.4.2 Exercise solution

library(tidyverse)
library(microseq)

############################
### With length on x-axis
###
bins.dir <- "bins"
bin.files <- list.files(bins.dir, pattern = "maxbin")
bin.tbl <- NULL
for(i in 1:length(bin.files)){
  bin.tbl <- readFasta(file.path(bins.dir, bin.files[i])) |>
    select(contig_id = Header) |>
    mutate(bin_id = str_remove(bin.files[i], ".fasta")) |>
    bind_rows(bin.tbl)
}
bin.tbl <- read_tsv("contig_coverages.tsv") |>
  right_join(bin.tbl, by = "contig_id")
fig <- ggplot(bin.tbl) +
  geom_point(aes(x = length, y = mean_coverage, color = bin_id), alpha = 0.3) +
  scale_x_log10() + scale_y_log10()
print(fig)

Each dot is a contig, colored by the bin it has been assigned to. We see the longer contigs also tend to have larger coverages. In general the contigs of the same bin appear in ‘bands’ at various coverages, but there is also a huge overlap between the bins in terms of coverage.

2.4.3 Exercise - GC-content

In the plot above we used the contig lengths on the x-axis. But this is not one of the criteria used for binning. Compute instead the GC-content of each contig, and use this on the x-axis instead of length. Is the grouping more visible now?

Hint: How do you compute GC-content? Read in the bin file, and for each contig compute the sequence length, then count how many G or C it has (str_count()) and divide by the length to getthe GC-content.

2.4.4 Exercise solution

bins.dir <- "bins"
bin.files <- list.files(bins.dir, pattern = "maxbin")
bin.tbl <- NULL
for(i in 1:length(bin.files)){
  bin.tbl <- readFasta(file.path(bins.dir, bin.files[i])) |>
    mutate(length = str_length(Sequence), nGC = str_count(Sequence, "G|C")) |>
    mutate(GC = nGC / length) |>
    select(contig_id = Header, GC) |>
    mutate(bin_id = str_remove(bin.files[i], ".fasta")) |>
    bind_rows(bin.tbl)
}
bin.tbl <- read_tsv("contig_coverages.tsv") |>
  right_join(bin.tbl, by = "contig_id")
fig <- ggplot(bin.tbl) +
  geom_point(aes(x = GC, y = mean_coverage, color = bin_id), alpha = 0.3) +
  scale_y_log10()
print(fig)

It looks like this view of the data better explains how maxbin2 has grouped the contigs. We can more clearly see how contigs of the same color (=same bin) are close to each other in both GC-direction and coverage-direction.



2.5 Binning with metabat2

As mentioned above, we would like to have alternative binning of the same contigs. Let us do the exact same exercise with the software metabat2. As usual, download the newest container for metabat2 from the galaxy web page.

Here is a code skeleton you should copy and edit:

#################
### Settings
###
threads=10
contigs_file=$COURSES/BIN310/module9/illumina_contigs.fasta
bins_dir=bins
abundance_file=$bins_dir/abundance.tsv
out_dir=$SCRATCH/binning
prefix=$out_dir/metabat
new_abundance_file=$out_dir/abundance_met.tsv

#####################
### Initialization
###
if [ ! -d $out_dir ]
then
    mkdir -p $out_dir
fi
if [ ! -d $bins_dir ]
then
    mkdir -p $bins_dir
fi
module load R/4.4.2

#########################################
### Make certain contigs are listed
### in the same order in abundance_file
###
Rscript -e "
args <- commandArgs(trailingOnly = T)
library(tidyverse)
library(microseq)
abundance.file = args[1]
new.abundance.file <- args[2]
contigs.file <- args[3]
abd.tbl <- read_tsv(abundance.file) |> 
  mutate(coverage = mean_coverage) |> 
  relocate(contig_id, length, coverage, mean_coverage, var_coverage)
abd.tbl <- readFasta(contigs.file) |> 
  select(contig_id = Header)|> 
  left_join(abd.tbl, by = 'contig_id')
write_tsv(abd.tbl, file = new.abundance.file)
" $abundance_file $new_abundance_file $contigs_file

######################
### Running metabat2
###
apptainer exec metabat2:2.18--h6f16272_0.sif metabat2 \
 --inFile $contigs_file \
 --abdFile $new_abundance_file \
 --minContig 1500 \
 --numThreads $threads \
 --outFile $prefix

####################################
### Copying and renaming bin files
###
Rscript -e "
<add R code here>
" $out_dir $bins_dir $prefix

Compare to the script for maxbin2 above, and you find it is similar in many ways. Here are the differences:

Again we need to edit the coverage table. The metabat2 want a file with five columns, while our has only four. You fix this by simply mutating in an extra column called coverage and copy the values in mean_coverage to this. The columns must appear in this order: contig_id, length, coverage, mean_coverage, var_coverage. In the script above the first chunk of R code (Rscript) is meant to fix this. In addition, the metabat2 is finicky with respect to contig ordering! The contigs must be listed in the exact same order in this file as in the contigs_file. This we also fix inside this R code chunk.

The command line for running metabat2 is given. Note that the length limit is 1500 here, instead of 1000 as for maxbin2. This is actually the smallest value that metabat2 will accept!

The last R code is for copying the bin fasta files from the out_dir to the bins_dir just as we did for maxbin2. The code will be quite similar.

2.5.1 Exercise - run metabat2

Fill in the missing R code in the shell script above, and sbatch to get some metabat-contigs!

Hint: In the last R code chunk you can copy what we did for maxbin. But, note that metabat2 will give all fasta files extension .fa instead of .fasta. Change this when you copy, i.e. the files you copy to should have extension .fasta (use str_replace()). We will see later it is nice if all our bin files have the same .fasta file extension.

How many bins did metabat2 group the contigs into? Is it more or less than maxbin2?

2.5.2 Exercise solution

#!/bin/bash

#SBATCH --nodes=1                        # We always use 1 node
#SBATCH --partition=BIN310               # For BIN310 course only
#SBATCH --reservation=BIN310             # For BIN310 course only
#SBATCH --account=bin310                 # For BIN310 course only
#SBATCH --ntasks=10                      # The number of threads reserved
#SBATCH --mem=10G                        # The amount of memory reserved
#SBATCH --time=01:00:00                  # Runs for maximum this time
#SBATCH --job-name=metabat2              # Sensible name for the job
#SBATCH --output=metabat2_%j.log         # Logfile output here

#################
### Settings
###
threads=10
contigs_file=$COURSES/BIN310/module9/illumina_contigs.fasta
bins_dir=bins
abundance_file=$bins_dir/abundance.tsv
out_dir=$SCRATCH/binning
prefix=$out_dir/metabat
new_abundance_file=$out_dir/abundance_met.tsv

#####################
### Initialization
###
if [ ! -d $out_dir ]
then
    mkdir -p $out_dir
fi
if [ ! -d $bins_dir ]
then
    mkdir -p $bins_dir
fi
module load R/4.4.2

#########################################
### Make certain contigs are listed
### in the same order in abundance_file
###
Rscript -e "
args <- commandArgs(trailingOnly = T)
library(tidyverse)
library(microseq)
abundance.file = args[1]
new.abundance.file <- args[2]
contigs.file <- args[3]
abd.tbl <- read_tsv(abundance.file) |>
  mutate(coverage = mean_coverage) |>
  relocate(contig_id, length, coverage, mean_coverage, var_coverage)
abd.tbl <- readFasta(contigs.file) |>
  select(contig_id = Header)|>
  left_join(abd.tbl, by = 'contig_id')
write_tsv(abd.tbl, file = new.abundance.file)
" $abundance_file $new_abundance_file $contigs_file

######################
### Running metabat2
###
apptainer exec metabat2:2.18--h6f16272_0.sif metabat2 \
 --inFile $contigs_file \
 --abdFile $new_abundance_file \
 --minContig 1500 \
 --numThreads $threads \
 --outFile $prefix

#######################
### Copying bin files
###
Rscript -e "args <- commandArgs(trailingOnly = T)
library(tidyverse)
out.dir <- args[1]
bins.dir <- args[2]
prefix <- args[3]
bin.files <- list.files(out.dir, pattern = str_c(basename(prefix), '.[0-9]+.fa'))
to.files <- str_replace(bin.files, 'fa$', 'fasta')
ok <- file.copy(from = file.path(out.dir, bin.files),
                to = file.path(bins.dir, to.files),
                overwrite = T)
" $out_dir $bins_dir $prefix

We find the maxbin2 produced 15 bins, while metabat2 resulted in 18 (or 17?) bins from the same data. This is actually quite similar, which is a good thing. The best would be if binning was such an easy problem that all tools produced the same result. We often see huge differences, much bigger than this.



2.6 Binning with concoct

Our third alternative binning tool is concoct. Again we need to supply the coverages for all contigs, and this time in the exact same format as with maxbin2 (two columns, contig_id and mean_coverage).

The concoct does not output the fasta files directly, but a table listing which contigs (their contig_id) are in each bin. We then add some R code at the end where we make this output to fasta files, as for the other tools.

Here is the shell script skeleton:

##########################
### Settings
###
threads=10
contigs_file=$COURSES/BIN310/module9/illumina_contigs.fasta
bins_dir=bins
abundance_file=$bins_dir/abundance.tsv
out_dir=$SCRATCH/binning_con
prefix=$out_dir/concoct
new_abundance_file=$out_dir/abundance_con.tsv

####################
### Initialization
###
if [ ! -d $out_dir ]
then
    mkdir -p $out_dir
fi
if [ ! -d $bins_dir ]
then
    mkdir -p $bins_dir
fi
module load R/4.4.2

############################
### Editing abundance table
###
<copy code from maxbin2 script above>

######################
### Running concoct
###
apptainer exec concoct:1.1.0--py39h8907335_8.sif concoct \
 --composition_file $contigs_file \
 --coverage_file $new_abundance_file \
 --basename $prefix \
 --length_threshold 1000 \
 --threads $threads

#################################
### Writing bins to fasta files
###
Rscript -e "args <- commandArgs(trailingOnly = T)
library(tidyverse)
library(microseq)
contigs.file <- args[1]
bins.dir <- args[2]
prefix <- args[3]
concoct.tbl <- read_delim(str_c(prefix, '_clustering_gt1000.csv'), delim = ',')
contigs.tbl <- readFasta(contigs.file)  |>
  inner_join(concoct.tbl, by = c('Header' = 'contig_id'))
ubins <- sort(unique(contigs.tbl\$cluster_id))
cat('Concoct produced', length(ubins), 'bins\n')
for(i in 1:length(ubins)){
  cat('*** writing bin', ubins[i], 'to FASTA file\n')
  bin.tbl <- contigs.tbl  |>
    filter(cluster_id == ubins[i])
  writeFasta(bin.tbl, out.file = file.path(bins.dir,
                                           str_c(basename(prefix), '.', ubins[i], '.fasta')))
}
" $contigs_file $bins_dir $prefix 

Note how the start is very similar to the other two binning scripts, we just change the $prefix and the name of the new abundance file. The code for editing the coverage table you can copy directly from the maxbin2 script above, concoct want the same input.

The command line is also quite similar to the other two, the options have different names, but the input is basically the same. You need of course to download the corresponding apptainer container to your working folder (module9) for this to work. We use the minimum contig length of 1000 as with maxbin2.

2.6.1 Exercise - learn how output looks like

The last chunk of R code has been left for you to fill in the remaining parts. Here we read in a table named str_c(prefix, '_clustering_gt1000.csv'). Since we set the prefix to $out_dir/concoct, and out_dir=$SCRATCH/binning it means we read in the table $SCRATCH/binning/concoct_clustering_gt1000.csv. In order to make this R code you need to know how this table looks like. How can we find this out?

Make the script and sbatch and let the job finish. The output should be found in $SCRATCH/binning/. Verify the table concoct_clustering_gt1000.csv is there.

Inspect the file, e.g. by running head on it in the Terminal. Notice how many columns, what do they contain (numbers?, Texts?) and if there any descriptive column names. This is how we always have to work in order to handle output and read it into R or similar. From this insight we can make the last chunk of R-code seen above.

Explain to yourself or some fellow student what this part of the R-code does:

concoct.tbl <- read_delim(str_c(prefix, '_clustering_gt1000.csv'), delim = ',')
contigs.tbl <- readFasta(contigs.file)  |>
  inner_join(concoct.tbl, by = c('Header' = 'contig_id'))

2.6.2 Exercise solution

We first read in the table produce by concoct from the out_dir. Then we read in the contigs fasta file. It appears that the table has a column named contig_id having the same identifier texts as in the Header column of the contigs file. We then join the two by this information to get the sequences into the table as well.

Also, the cluster_id column contains the bin identifiers (numbers). We then loop over all bins and output them one by one to the bins_dir. Note how these new fasta file get names containing the original prefix information, the bin identifier number and the file extension .fasta. This means the concoct bin files get names according to the same ‘pattern’ as those from maxbin2 and metabat2.



2.7 Summary

We have seen three different tools for doing the exact same job, namely grouping our contigs into clusters or bins in such a way that those contigs in the same bin are expected to be from the same genome. They all need as input the contigs (of course) and a text file with information about the coverage of each contig. They will treat this information slightly different, and they produce different results. Since binning is such a difficult task to get correct, we cannot assume any of the three are ‘correct’, but hopefully we get closer to the truth if we now ‘average’ out what they have found.

Note how each of the three tools were given a separate out_dir in which they output whatever they produce. At the end fo each script we copy the bin fasta file to the same bins_dir folder. Could we not have used the same out_dir for all three tools? We could, but by making them output to separate folders we can run them all simultaneously! This means we can sbatch all three in turn, without waiting for the one tofinish before we start the next. If they had written to the same out_dir this may go wrong, since they may produce files with identical names, and overwrite each other.

On the other hand, the bins_dir we store the final bin files is the same for all three. This is a good idea, as we will see in the next section below. Since we have given the bin fasta files different prefixes, we can see from the file names which toool produced which bin.

Note that if you have contigs from several different samples, you do the binning separately for each sample, and output the bins into a separate bins_dir folder for each sample.

The reason is of course that genomes in different samples may be different genomes, but even if they are not, the coverage of the genomes is surely sample-specific. Thus, never mix contigs from different samples when you do the binning. We will combine them in the end, but not at this stage!





3 Ensemble binning

We will now use the software DASTool to try to get the best out of the three sets of bins we have created.

You can read about how this tools works here:Recovery of genomes from metagenomes via a dereplication, aggregation and scoring strategy.

In brief, we can say that single copy genes play a central role. These are genes we believe should only occur in a single copy in any genome. Thus, any bin should not have more than one contig having such a gene. Based on this we can ‘learn’ which bins are most likely pure genomes and not a mixture of two or more.

Also, two different bin sets may combine the same contigs different, e.g. contig A is grouped with contig B and C with maxbin2, while contig A is grouped with B and D with concoct. But contig A cannot be in more than one final bin! This is also resolved by DASTool.

Finally, this tool will also compute some other quality characteristics, but this we will look more into in our next Module.



3.1 Preparing input for DASTool

We already have our bins from the three binning tools in the same folder (the bins folder). We have also given them file names indicating which tool they came from (maxbin2, metabat2 or concoct).

In order to use any tool, we need to understand a little bit what it expects as input. Here is a snippet of the Help text we get from running the DAS_Tool -h command in the Terminal:

apptainer exec das_tool:1.1.7--r44hdfd78af_1.sif DAS_Tool -h

Usage:
  DAS_Tool [options] -i <contig2bin> -c <contigs_fasta> -o <outputbasename>

Options:
   -i --bins=<contig2bin>                   Comma separated list of tab separated contigs to bin tables.
   -c --contigs=<contigs>                   Contigs in fasta format.
   -o --outputbasename=<outputbasename>     Basename of output files.
   -l --labels=<labels>                     Comma separated list of binning prediction names.

We notice from Usage that we have to supply some -i <contig2bin>, then the contigs as a fasta file and finally some -o <outputbasename> which is probably just some prefix text to use as start of all output file names. But what is <contig2bin>?

By studying the GitHub page we realize this is a tab-separated text file with two columns:

  • The contig_id’s for each contig in the first column.
  • The bin_id’s for each contig in the second column.

This is in fact exactly the same as the concoct output, and that we used in our concoct script when we produced the bin fasta files as output there.

3.1.1 Exercise - Prepare contig-to-bin tables

Make R code that reads all maxbin2 bin fasta files, and make a table with two columns. The first column must contain the contig_id for each contig, the second contains the bin_id, which is simply the file name minus the file extension. Here is a code skeleton you may start with:

library(tidyverse)
library(microseq)

##############################
### Making ctg2bin for maxbin
###
bin.files <- list.files(___, pattern = "maxbin.[0-9]+.fasta")  # names of bin-files
ctg2bin.tbl <- NULL
for(i in 1:length(bin.files)){
  ctg2bin.tbl <- readFasta(file.path(___, ___)) |> 
    mutate(contig_id = word(Header, 1, sep = "\t")) |>
    mutate(bin_id = str_remove(___, ".fasta")) |> 
    select(contig_id, bin_id) |> 
    bind_rows(ctg2bin.tbl)
}
write_tsv(___, file = "bins/maxbin_ctg2bin.tsv", col_names = FALSE)

NOTE 1: Here we mutate the contig_id as the first word in each Header, where we separate words by tab \t. In the files from maxbin2 and concoct there are only one word in these Headers, but in the metabat files some more has been added, separated by tab. This code should work in all three cases.

NOTE 2: You may get a warning when reading some of the metabat2 files. This is due to the tabs in the Headers, which is not expected by readFasta(). Re-install microseq from github (https://github.com/larssnip/microseq) where we have fixed this (version 2.1.8). The version on CRAN may not be the newest (yet).

The code skeleton above is for reading maxbin2 bins only, but the code for doing the same for metabat2 and concoct should be almost identical.

3.1.2 Exercise solution

library(tidyverse)
library(microseq)

##############################
### Making ctg2bin for maxbin
###
bin.files <- list.files("bins", pattern = "maxbin.[0-9]+.fasta")
ctg2bin.tbl <- NULL
for(i in 1:length(bin.files)){
  ctg2bin.tbl <- readFasta(file.path("bins", bin.files[i])) |>
    mutate(contig_id = word(Header, 1, sep = "\t")) |>
    mutate(bin_id = str_remove(bin.files[i], ".fasta")) |>
    select(contig_id, bin_id) |>
    bind_rows(ctg2bin.tbl)
}
write_tsv(ctg2bin.tbl, file = "bins/maxbin_ctg2bin.tsv", col_names = FALSE)

##############################
### Making ctg2bin for metabat
###
bin.files <- list.files("bins", pattern = "metabat.[0-9]+.fasta")
ctg2bin.tbl <- NULL
for(i in 1:length(bin.files)){
  ctg2bin.tbl <- readFasta(file.path("bins", bin.files[i])) |>
    mutate(contig_id = word(Header, 1, sep = "\t")) |>
    mutate(bin_id = str_remove(bin.files[i], ".fasta")) |>
    select(contig_id, bin_id) |>
    bind_rows(ctg2bin.tbl)
}
write_tsv(ctg2bin.tbl, file = "bins/metabat_ctg2bin.tsv", col_names = FALSE)

##############################
### Making ctg2bin for maxbin
###
bin.files <- list.files("bins", pattern = "concoct.[0-9]+.fasta")
ctg2bin.tbl <- NULL
for(i in 1:length(bin.files)){
  ctg2bin.tbl <- readFasta(file.path("bins", bin.files[i])) |>
    mutate(contig_id = word(Header, 1, sep = "\t")) |>
    mutate(bin_id = str_remove(bin.files[i], ".fasta")) |>
    select(contig_id, bin_id) |>
    bind_rows(ctg2bin.tbl)
}
write_tsv(ctg2bin.tbl, file = "bins/concoct_ctg2bin.tsv", col_names = FALSE)



3.2 Using DASToolon Orion

Download an apptainer container from galaxy. The containers have the name das_toolin addition to version numbers etc.

Here are the essential ingredients of a shell script for running this tool and get the final ensemble bins:

#####################
### Ensembl binning
###
apptainer exec das_tool:1.1.7--r44hdfd78af_1.sif DAS_Tool \
 --write_bins \
 --threads $threads \
 --bins $bins_dir/metabat_ctg2bin.tsv,$bins_dir/maxbin_ctg2bin.tsv,$bins_dir/concoct_ctg2bin.tsv \
 --labels metabat,maxbin,concoct \
 --contigs $contigs_file \
 --outputbasename $prefix

##############################
### Copying final bins home
###
Rscript -e "args <- commandArgs(trailingOnly = T)
library(tidyverse)
bins.dir <- args[1]
final.dir <- args[2]
prefix <- args[3]
from.files <- list.files(str_c(prefix, '_DASTool_bins'), full.names = T)
to.files <- file.path(final.dir, basename(from.files))
to.files <- str_replace(to.files, 'fa$', 'fasta')
ok <- file.copy(from = from.files,
                to = to.files,
                overwrite = T)
" $bins_dir $final_dir $prefix 

3.2.1 Exercise - run DASTool on the contigs

In the shell script above the start of the script is lacking. Add what is needed to make this working. You can copy from the previous binning scripts and edit. Use the same number of threads, amount of memory and wall time as in the previous scripts. Some hints:

  • Again, make an out_dir under SCRATCH. See what we did for the three binners above.
  • The prefix should perhaps be $out_dir/DASTool? See what we did for the three binners above.
  • The final output should be sent to final_dir=final_bins which is then a new folder inside your working folder.

3.2.2 Exercise solution

#!/bin/bash

#SBATCH --nodes=1                        # We always use 1 node
#SBATCH --partition=BIN310               # For BIN310 course only
#SBATCH --reservation=BIN310             # For BIN310 course only
#SBATCH --account=bin310                 # For BIN310 course only
#SBATCH --ntasks=10                      # The number of threads reserved
#SBATCH --mem=10G                        # The amount of memory reserved
#SBATCH --time=04:00:00                  # Runs for maximum this time
#SBATCH --job-name=dastool               # Sensible name for the job
#SBATCH --output=dastool_%j.log          # Logfile output here

#################
### Settings
###
threads=10
bins_dir=bins
contigs_file=$COURSES/BIN310/module9/illumina_contigs.fasta
out_dir=$SCRATCH/binning_das
prefix=$out_dir/DASTool
final_dir=final_bins

###################
### Initializing
###
if [ ! -d $out_dir ]
then
    mkdir -p $out_dir
fi
if [ ! -d $final_dir ]
then
    mkdir -p $final_dir
fi
module load R/4.4.2

#####################
### Ensembl binning
###
apptainer exec das_tool:1.1.7--r44hdfd78af_1.sif DAS_Tool \
 --write_bins \
 --threads $threads \
 --bins $bins_dir/metabat_ctg2bin.tsv,$bins_dir/maxbin_ctg2bin.tsv,$bins_dir/concoct_ctg2bin.tsv \
 --labels metabat,maxbin,concoct \
 --contigs $contigs_file \
 --outputbasename $prefix

##############################
### Copying final bins home
###
Rscript -e "args <- commandArgs(trailingOnly = T)
library(tidyverse)
bins.dir <- args[1]
final.dir <- args[2]
prefix <- args[3]
from.files <- list.files(str_c(prefix, '_DASTool_bins'), full.names = T)
to.files <- file.path(final.dir, basename(from.files))
to.files <- str_replace(to.files, 'fa$', 'fasta')
ok <- file.copy(from = from.files,
                to = to.files,
                overwrite = T)
" $bins_dir $final_dir $prefix

It looks like only 8 (or 10?) bins “survived” the ensemble binning, i.e. all the suggested bins from the three tools are narrowed down to these distinct bins.

Note that some bin files may have gotten _sub in their file names. Such bins have been edited by DASTool and are not longer identical to the corresponding bin file from the tool it originated. Some contigs have been removed due to overlap with some other bin. We cannot accept that the same contig is part of two distinct bins at the end.



3.3 Final words

Above we stated very clearly that you should do the binning sample by sample. This is also the case for this ensemble binning. Never mix bins/contigs from different samples as input to DASTool.

However, the output we get from this tool may now be stored together in one and the same folder. This means that if you have data from several samples, you should treat these separately all the way through the ‘pipeline’ we have seen in this module, but here, at the end of it, you may put them all into the same final folder (the final_bins folder we suggested above).

In the next module we will do some quality assessment of the bins, and then decide if some of them are actually re-constructions of the same genome. The latter may of course happen if you have several samples and some of the genomes are common to two or more of them.