1 Introduction

Here is a ‘pipeline’ and some recommendations concerning the steps needed to reconstruct Metagenome Assembled Genomes (MAGs) from Illumina shotgun metagenome data. It is made for running on the orion HPC cluster, using mostly apptainer containers downloaded from galaxyproject.org.

All the scripts described below are found on orion under /mnt/users/larssn/MIDIV/Whole_genome_sequencing/Illumina/MAGmachine/.

The code you find under this folder must not be edited in this folder!

You should always copy the code to your own folder first, and then you may edit. Think of the code here as a template, that you give your personal touch once you have your own copy.

1.1 On orion

Make a folder under your HOME folder somewhere. Move into this folder (use cd). Then copy the scripts:

cp /mnt/users/larssn/MIDIV/Whole_genome_sequencing/Illumina/MAGmachine/*.sh .

You may not need all scripts for all projects, but it is probably wise to copy all anyway.

1.2 On sigma2

In this pipeline all software tools used are avilable as apptainer containers. These containers are currently found under /mnt/users/larssn/MIDIV/Whole_genome_sequencing/apptainer/, as .sif files. This means we may move this entire lot into saga on the sigma2 facilities, and run the same pipeline there.

2 The metadata file

Before you start processing the data you need to prepare a file with some basic information about each sample. We refer to this as the metadata file.

We have our local way of organizing data, and these scripts are designed for this organization. This means

Raw data (fastq file pairs) to be processed are all found inside a single folder.
The name of this folder we refer to as the SequencingRunID.
All such SequencinRunIDs are part of some ProjectID, which is simply a unique text indicating a project. This links it to other sources of information about the project, but we do not pursue this here.

Inside the SequencingRunID folder we expect to find an Excel-file in addition to all the fastq files. This is the metadata file. The content of this metadata file must follow some rules:

There must be one row for each sample, i.e. for each file pair. If raw data for a sample comes split over several files, append these first such that you get one single pair of fastq files (R1 and R2 file) for each sample.
There must be two columns named Rawfile_R1 and Rawfile_R2 in the metadata file, and these must list the fastq file names for each sample. NOTE: No paths, just the name of the files, e.g. SRR353639_R1.fq.gz.
There must be a column named SampleID with a single text that uniquely identifies each sample. This will be used as part of file names, and therefore avoid special characters like /. Use only (english) letters, numbers or underscore (_)!
There must be a column named SequencingRunID with the name of the folder, as explained above. This will be the same text in every row.
There must be a column named ProjectID with the name of the project, as explained above. This will be the same text in every row. Again, this will also be used as part of file names, and avoid special characters. Use only (english) letters, numbers and underscore!

Here is an example of how the metadata file may look like, at least parts of it: An excerpt from a metadata Excel file

Note that this file may, and should, contain many more columns than the 5 we mentioned above. In this picture we see a column named RefID as well, and there are many more, containing data about the data. But, for this pipeline only the 5 columns mentioned specifically above (`ProjectID, SequencigRunID, SampleID, Rawfile_R1, Rawfile_R2) are required.

3 From raw reads to filtered fastq-files

The entire procedure consists of four basic steps who are done sample-by-sample, followed by some additional steps working on the collected results from all samples.

The first part we describe here is a processing of the raw reads which is always done sample-by-sample. This part requires comparatively little memory and time, mainly due to parallel computing using array jobs. Thus, you always start by running all samples through these steps.

3.1 The `mainscript.sh`

This part of the pipeline is started from the shell-script named mainscript.sh. Thus, you always start any jobs by

sbatch mainscript.sh

from your terminal, and being in the folder where mainscript.sh was stored (see above). Then this script will call on the numbered scripts (script_0_fastqc.sh, script_1_filter_trim.sh, etc…) in the same folder as we go along.

Typically, the mainscript.sh is the only file you edit. You may change the other scripts as well, if you know what you are doing. We typically use RStudio to edit files, but any editor will do. There is no R code you run explicitly in this pipeline, but the shell scripts use R in some cases.

3.1.1 Array-jobs

This pipeline uses array jobs. This means we run the processing of several samples in parallel. The header of mainscript.sh looks something like this:

#!/bin/sh

#SBATCH --array=1-12%6                  # The samples to process
#SBATCH --ntasks=10                     # Threads. Use 20 for assembly, 10 otherwise
#SBATCH --nodes=1                       # Never change this
#SBATCH --job-name=MAGmachine           # Give your job a name
#SBATCH --mem=50G                       # Use 250G or more for assembly, 50 otherwise
#SBATCH --output=MAGmachine_%j_%a.log   # Inspect logfile when something crashes, then delete it
#SBATCH --partition=smallmem,hugemem    # We rarely use smallmem these days...
#SBATCH --time=24:00:00                 # If your job takes more than 24 hours, consider sigma2!

You need to edit the very first line indicating the number array jobs used here. In the code above it says #SBATCH --array=1-12%6, meaning we want to process sample 1,2,…,12, but never more than 6 at the time. These jobs will then be started in parallel, which is of course much faster than running one-by-one. Thus, you need to specify the number of samples you have. Note that you do not need to process all samples in one go like this, you may also for examples just process samples 1, 3 and 10 by:

#SBATCH --array=1,3,10

This is typically something we do when we, for some reason, need to re-compute something for these samples.

Note that each array-job produces its own log-file, names MAGmachine_xxx_yyy.log, where xxx and yyy and integers. If a job crashes, inspect the corresponding log-file to see what went wrong. Delete such log-files after you are done.

Note also that not all array jobs may be started right away, but are waiting in the queue (pending). This depends on available resources. Especially the assembly (step 3) requires much resources and may wait in queue for a long time.

IMPORTANT The orion HPC is a smallish system. Never start many array jobs that will run for a long time, since this will block the system for others. There is a guideline:

When using 10 threads per job, you may submit up to 12 array jobs.
When using 20 threads per job, you may submit up to 6 array jobs.

Thus, since we are nice people we stick to this.

3.1.2 Threads and memory

In the header you also specify how many threads to use and how much memory to reserve:

#SBATCH --ntasks=10                     # Threads. Use 20 for assembly, 10 otherwise
#SBATCH --mem=50G                       # Use 250G or more for assembly, <100 otherwise

Use 10 threads, except for the assembly step, where you may increase to 20. More threads means faster processing, but this levels out and there is little to gain from using more than 20 threads. All steps except the assembly requires relatively little memory. Her we allocate 50GB, which is usually plenty. For assembly you need to reserve at least 250GB, which is what metaspades uses by default. We will come back to this below.

3.1.3 Settings

The first chunk of code in mainscript.sh is called Settings. Here are the lines:

work_folder=$SCRATCH/work                                            # All output (a lot!) will end up here
rawdata_folder=/mnt/smb/KBM_MIDIVlab/Rawdata/Whole_genome_sequencing/Illumina/<folder_name>  # Raw fastq files are here
metadata_file=$rawdata_folder/metadata_Tutorials.xlsx                # Excel file in rawdata_folder/
metadata_file_local=$work_folder/metadata_Tutorials.txt              # Local copy in plain text format will be written here
threads=10
# human_index=/mnt/users/larssn/MIDIV/Whole_genome_sequencing/hostgenomes/bowtie2/human_genome
# salmon_index=/mnt/users/larssn/MIDIV/Whole_genome_sequencing/hostgenomes/bowtie2/salmon_genome
# mouse_index=/mnt/users/larssn/MIDIV/Whole_genome_sequencing/hostgenomes/bowtie2/mouse_genome

You need to specify your work_folder, which is where your output will end up. Since this pipeline produces a lot of output, this requires much storage space. Use either a folder under your $SCRATCH for this, or use some project folder where you have plenty of gigabytes available. Make a folder like this:

mkdir $SCRATCH/work

where you replace work with whatever you want to name this folder. You can visit or list the content of this folder, just like any other folder.

The rawdata_folder is where your raw data are stored. This is typically some folder under /mnt/smb/KBM_MIDIVlab/Rawdata/Whole_genome_sequencing/Illumina/. The metadata_file is the required excel file in that folder, with the proper columns following the MiDiv standard from summer 2022, see section above about this. The file named in metadata_file_local is a text-file version that will be created by the pipeline. This typically has the same name as in metadata_file but where .xlsx is replaced by .txt.

The threads must always have the same value as in #SBATCH --ntasks=10 in the header. Remember, if you change to 20, edit both places.

Three lines are commented out here. They are locations of bowtie2 indexes, used in step 2 below. But, step 2 is only used when data are from some host, here either human, salmon or mouse. If you are going to run step 2, and have data from humans, comment in the line specifying the human_index. With data from environemnts (e.g. sediments) you may omit step 2 below, and none of these indexes are needed.

Below we now describe the first steps called from mainscript.sh.

3.1.4 Reading the metadata

After the Settings you find first a chunk of R code for reading the meta data file:

module load R/4.3.3
Rscript -e "a <- commandArgs(trailingOnly = T)
if(!file.exists(a[2])){
  cat('***********************************************************************************\n')
  cat('Reading excel-file ', a[1], ' with metadata...\nwriting to text-file ', a[2], '\n')
  cat('***********************************************************************************\n')
  tb <- readxl::read_excel(a[1])
  write.table(tb, file = a[2], sep = '\\t', quote = F, row.names = F)
}" $metadata_file $metadata_file_local

Here the Excel-file specified in metadata_file is read, and a text copy of it is written to your work_folder. This file is never over-written, and if you for some reason want to replace it you must first delete the old one and then re-run this code.

After this you find a couple of code chunks with awk code for reading the meta data. Never comment out this code!

3.2 Step 0 - read quality inspection

Here we only inspect the reads using the fastqc software:

./script_0_quality_inspection.sh $Rawfile_R1 $Rawfile_R2 $work_folder/quality_inspection $threads

The input are the two files of raw reads (content of variables Rawfile_R1 and Rawfile_R2), the folder to output to (here $work_folder/quality_inspection) and the number of threads to use ($threads). The folder you specify as third argument here will be filled with one HTML-file for each fastq file. You can inspect these files to see how the reads looks like (number of readpairs, length, quality, adapters etc.).

It may be a good idea to run only this step on all reads before you do anything more. This means you first comment out all later steps in mainscript.sh, i.e. scroll down and comment out all lines.

You may also omit this step, since its output is not required for any other step.

3.3 Step 1 - read filtering and trimming

Here we use the software bbduk to

Trim away adapter sequences
Filter out too poor reads
Trim off ends of reads if these are of too low quality

The code:

./script_1_filter_trim.sh $Rawfile_R1 $Rawfile_R2 $work_folder/fastq $file_stem $threads

is much like that in step 0, and you name the output folder in the third argument ($work_folder/fastq). This folder will now be filled with fastq-files having names following the pattern: ProjectID_SequencingRunID_Sample_ID followed by a proper extension. These three pieces of information are collected from the corresponding columns of the metadata file. When pasted together like this we refer to this as a file stem. Each sample then has a unique file stem. Note that for each array job there will be a corresponding file stem, and this identifies which sample is processed by each array job.

There will typically be three files for each sample:

The R1 reads surviving the filtering.
The R2 reads surviving the filtering.
The unpaired reads, whose mate (either R1 or R2) did not survice the filtering.

These files are always compressed to save disk space.

Inside the script_1_filter_trim.sh you may edit the settings if you like. As it is now, it will not filter very hard, but it has proven to be hard enough in most cases. The Illumina data we get these days are usually of very high quality.

At the end of this script you find an alternative code for using another software trimmomatic to do a similar job. This will filter much harder. You then simply comment out the first part using bbduk, and comment in the latter part using trimmomatic.

Based on the inspection from step 0 or above, you may edit the script script_1_filter_trim.sh. Under Settings you may for example trim the reads. It looks like this:

K=23            # Set to 21 for low quality data
h_dist=1        # Set to  2 for low quality data
Q_trim=20       # Quality limit for 3' end trimming
Q_discard=20    # Discard reads with average Quality below this
trim_left=0     # Trim this many bases off 5' end
trim_right=150  # Trim bases beyond this position
min_len=30      # Discard reads shorter than this

If you increase the trim_left=0 value to trim_left=5, you trim 5 bases off the start of the reads. The trim_right=150 indicates from where trimming starts in the right end. If reads are 150 bases long, no trimming is done here, but with trim_right=145 we would also trim 5 bases off the right end. Notice that if you have longer reads than 150, you must increase this value accordingly!

3.4 Step 2 - de-contamination

This step is only required if your data are from some host organism, and you want to remove reads coming from this host. For samples from environments (e.g. sediments) this step is omitted.

Here is how it looks like if we want to remove reads from a human host:

./script_2_decontaminate.sh $work_folder/fastq $file_stem $human_index $threads

The software bowtie2 will map all reads (both paired and unpaired) to the host genome, and using the software samtools to collect all reads that do not map and write these to corresponding fastq-files in the same folder ($work_folder/fastq in the code above). Thus, after this step you should have the same set of fastq-files as you had after the previous step, but they may have shrinked in size. Inspect the log-files to see how many reads were mapped, this is reported by bowtie2. If a large percentage of the reads map, and is then lost here, the samples are very contaminated.

As part of step 2 is also another read quality inspection, using fastqc. This will again produce an HTML file for each of the R1 and R2 fastq-file, in the same folder as the fastq-files. You may then inspect these and compare to those from step 0, to see if quality has improved and how much data you have lost as paired-end (the unpaired reads are still there). You may also omit this step (comment out) if you feel this inspection is not needed.

As mentioned above, you typically run the steps down here in one session, using 10 threads and 50GB memory. Once these steps are done for all samples, you comment out these steps and start the assembly (step 3).

4 The assembly

This is the step that takes most computational resources. You should typically process all samples through the above steps of the pipeline before you come here and start the assembly.

There are basically two ways ahead from here now

Assembly sample-by-sample. This is the simplest and most straightforward approach.
Co-assembly of super-samples. This requires an extra step, see below.

We describe how to do both below. In both cases you still use the mainscript.sh and array-jobs, but for co-assembly you need to do some extra processing, as we will see.

4.1 Step 3 - Assembly sample-by-sample

You typically assemble each sample by itself if the samples are quite different from each other. This is often the case if we have samples from outdoor environments. This is also the easiest and most straightforward way of doing the assembly.

Note: We always do a de-replication of the results at the end (see script_5_MAGmaker.sh below), and if the same genome is re-constructed in different samples, this will then be detected. Thus, a sample-by-sample assembly is always meaningful, even when samples are very similar.

Again you simply edit the mainscript.sh, and use one array-job for each sample. Comment out the previous steps (that you should have completed), and comment in step 3:

./script_3_assembly.sh $work_folder/fastq $file_stem $work_folder/assembly $threads

The first two arguments identify the fastq files with reads to assemble: The folder they are in and their file stem. The third argument is again the ouput folder, here we have chosen $work_folder/assembly. The last argument is the number of threads to use.

The code in script_3_assembly.sh may be edited. It uses metaspades for assembly, and the amount of memory used is by default 250GB. If you need more memory, you must change this inside script_3_assembly.sh in addition to editing the mainscript.sh heading (#SBATCH --mem=).

This script will output one folder for each sample, inside the folder given as third argument above. These folders all start with tmp_ followed by the file stem. These folders you should delete after the job is finished. It is not deleted by this pipeline, since you may want to inspect some of its results. But, it occupies a lot of space (metaspades produces a lot of ouput!), and must be deleted as soon as you can.

The only result copied from each tmp_ folders is the fasta file with the final contigs. These are stored in the output folder, with the file stem followed by _contigs.fasta for each sample. NB! There should be one such contigs-file for each sample. If some are lacking, please inspect the corresponding log-file to see what went wrong. Assembly is a big job with many steps, and plenty of things could go wrong here. You may then need to re-start the assembly for these samples (edit the --array= argument in the mainscript.sh header, see above). Note that as long as you keep the tmp_ folder, metaspades may be re-started from the last checkpoint, see the spades manual.

Note that when running the assembly, you need to edit the mainscript.sh to

Increase from 10 to 20 threads.
Increase from 50 to 250 gigabytes of memory (or more).

Finish the assembly for all samples before you proceed.

4.2 Step 3 - Co-assembly

In cases when some samples contain more or less the same organisms, it may be beneficial to merge the reads from several samples into super-samples. This will give you more coverage per genome, and may result in better contigs. Note that if these super-samples become too big, it may not be possible to assemble them on orion. It simply takes too much memory. It will also take a lot more time.

4.2.1 Super-samples

In order to do a co-assembly, your metadata_file (the excel-file, see above) must contain a column named SuperSampleID. This must contain text indicating which of the original samples should be grouped into super-samples. In the Tutorial here, the SampleID and the SuperSampleID columns look like this:

SampleID  SuperSampleID
sample_1  low_diversity
sample_2  low_diversity
sample_3  low_diversity
sample_4  high_diversity
sample_5  high_diversity
sample_6  high_diversity

This indicates that the 6 samples should be grouped into 2 super-samples, where the first 3 and the last 3 samples should be merged, and the assembly done for these 2 super-samples.

To create the super-samples, use the script script_super_samples.sh. This is not called from mainscript.sh! The reason is this is done once, not sample-by-sample. In this script you need to edit the Settings part:

work_folder=work                                         # copied from mainscript.sh
metadata_file_local=$work_folder/metadata_Tutorials.txt  # copied from mainscript.sh
fastq_folder=$work_folder/fastq                          # copied from mainscript.sh step 1 or 2
cofastq_folder=$work_folder/cofastq                      # new folder with new fastq-files
metadata_file_coassembly=$work_folder/metadata_Tutorials_coassembly.txt  # new metadata file

The setting of work_folder and metadata_file_local is identical to what you have in mainscript.sh, copy from there. Also, you need to specify where the merged fastq-files after step 3 above are.

Next, you need to specify the folder to store the super-samples (cofastq_folder), i.e. the new fastq-files. This folder will be created. Finally, you also need a new copy of the metadata file, in metadata_file_coassembly. The reason for the latter is that here the SuperSampleID is now taking over the role as SampleID, and we should store it under a new name to avoid messing up the original. You may need to re-run the early steps later, and then you need the original metadata file again.

You run this by

sbatch script_super_samples.sh

The entire merging is done in R. This may seems strange, but the file.append() function in R is super fast and can actually append compressed files. Avoiding the need to specifically de-compress/compress saves a lot of time!

4.2.2 The assembly

Once you have the super-samples, you need to edit the mainscript.sh slightly before you proceed:

The array-jobs must now be one for each super-sample, not for each sample. Change this in the mainscript.sh header.
The metadata_file_local must now be the new metadata-file created by script_super_samples.sh. Edit this in the Settings of mainscript.sh.

The call to the step 3 script in mainscript.sh is identical to what we did for the sample-by-sample assembly above, just with (potentially) other folder names:

./script_3_assembly.sh $work_folder/cofastq $file_stem $work_folder/coassembly $threads

Notice we changed the folder name in the second argument, to use the reads in $work_folder/cofastq this time. This must be same folder you used as cofastq_folder in super_samples.sh. We also output to a folder named $work_folder/coassembly here.

You may have to increase the memory from defualt 250G to 500G or even more, if the super samples are huge. Again, consider using sigma2 instead of orion if the super-samples are very large.

Notice that in the last step below, you now continue to work on the coassembly of the super-samples, with one array job for each super-sample.

5 Last step with `mainscript.sh`

This is the last part where you use the mainscript.sh solution. This means you process everything sample-by-sample, or super-sample-by-super-sample if you used co-assembly, see above.

5.1 Step 4 - binning

Binning means clustering the contigs that is believed to come from the same genome. Thus, a bin is a collection of contigs, and is an estimate of a genome. The input to this step is

The contigs you got for each sample or super-sample above.
The reads for each sample/super-sample that were used as input to the assembly.

You comment out all steps in the mainscript.sh, except for the last:

./script_4_binning.sh $work_folder/assembly $work_folder/fastq $file_stem $work_folder/binning $threads

First you specify the folder with the contigs, next the folder with all the reads, then the file stem, output folder and threads as before.

Notice that if we did a co-assembly like described above, we must use $work_folder/coassembly as the second argument and $work_folder/cofastq as the third.

For binning we use three different software tools:

metabat2
maxbin2
concoct

This means we make three sets of bins from all contigs. Binning is a step where there is no clear ‘first choice’ software, and the results we get from different tools may sometimes vary a lot. This is why we use three tools here, and consider all their output.

Then, once these are done, we use the software DAS_Tool to make an ensemble binning. This essentially means we take all bins from the three tools above, and try to reduce it to one set of bins. A score for each bin is computed based on some reference single-copy genes. Bins are then ranked by this score, the best first. If the best bin is above a given score-threshold it is accepted as a final bin. Contigs belonging to this bin is then removed from all other bins, and all remaining bins are scored again. Then this selection is repeated as long as the best remaining best bin is above the score-threshold. Final bin files with _sub in their names have been reduced through this process, i.e. lost some contigs compared to the original binning. This process may reduce the contamination in too large bins.

5.1.1 Settings

If you inspect the script_4_binning.sh you find the Settings section:

maxbin_markerset=107  #40    # consider using 40 for samples with many 'new' taxa
concoct_Kmer_length=4
concoct_read_length=150

You may want to use the smaller (40) markeset for maxbin in cases where you have data with many ‘new’ taxa, e.g. environmental samples. The concoct settings you may also edit if reads are longer/shorter than 150 bases.

Only contigs of at least 1000 bases are ever included in some bin, for metabat2 this threshold is actually hardcoded to 1500 bases.

The final output will be in the folder you give as the fourth argument ($work_folder/binning in the command line above). Each sample will be processed in a subfolder under this named by its file_stem. After completion, you find the raw bins in this subfolder, but the final bins from DAS_Tool are all moved to the output folder (e.g. $work_folder/binning/). In the subfolder you also find a summary table from DAS_Tool that lists some statistics for each final bin, in case you are interested in this. All other files in this subfolder are deleted.

Upon completion the output folder will (hopefully) contain some fasta files with the bins. In the file names you find some numbers indicating the bin. Note that these numbers are just random labels, bin number 1 from one software does not necessarily correspond to bin 1 from the other.

Note that there may be zero bins! All three tools have some requirements for making a bin, and the DAS_Tool has a score-threshold, and if the contigs are too poor there will be no bins.

Note that this step does not require a lot of resources, and you should go back to using 10 threads and only 50GB of memory, as in the first steps above. It will, however, take some time since there are 7 different software tools involved here (seqkit, bowtie2, samtools, metabat2, maxbin2, concoct, DAS_Tool) in addition to some R code.

6 Practical advices so far

6.1 Split the pipeline into three

You may run all steps in the pipeline in one go, but in most cases we have found it convenient to split it into three separate parts:

Run steps 0-1-2 in one go (step 2 only for host data) for all samples. These steps are fairly small, and may all be run with 10 threads and 50GB memory. Each array-job should take less than an hour, unless data sets are huge. Simply comment out steps 3 and 4, and sbatch mainscript.sh.
Then run only step 3 for all samples. This requires more threads and more memory. Instead of wasting this for all steps, you now run this step separately, with the extended resources, i.e. you edit the mainscript.sh prior to starting this. Comment out all except step 3, save and sbatch again. Remember, never more than 6 array jobs at the time on orion!
Finally run step 4 for all samples, with 10 threads and memory as in part 1) again. Comment out all except step 4.

6.2 Clean your `work_folder`

The work_folder set in the mainscript.sh will fill up with a lot of folders/files. It is a good idea to clean this to avoid filling up your disk quota. Delete all the tmp_ folders under $work_folder/assembly/ (or $work_folder/coassembly/). In the binning folder you may also delete the files who are not bin-files, but most of it should have been deleted by the binning script code

The files you always keep are:

The filtered reads under $work_folder/fastq/ and/or $work_folder/cofastq/. You will need these below.
The contigs under $work_folder/assembly/. These are smallish fasta-files.
The bins under $work_folder/binning/. These are all fasta-files ending with .fasta.

7 The MAGmaker

The basic output from the steps above are the binned contigs found in some folder ($work_folder/binning/ from mainscript.sh). From now on we work on the entire set of bins, and there is no sample-by-sample processing anymore. Thus, we no longer use the mainscript.sh.

In the script script_5_MAGmaker.sh we now do the following processing of all bins:

Quality asessment using checkm2
De-replication using dRep
Taxonomy assignments using gtdbtk

7.1 The `script_5_MAGmaker.sh`

You need to edit the Settings in this script:

work_folder=work
bins_folder=$work_folder/binning      # folder with all bins from script_5_binning.sh
out_folder=$work_folder/MAGs          # all output will end here
min_completeness=75                   # minimum percent completeness required for MAG (default 75)
max_contamination=25                  # maximum percent contamination required for MAG (default 25)
threads=10

You need of course to specify where your bins are, in bins_folder. The out_folder is where you MAG results will end up. Inside this folder three new folders will be created, one for checkm2 one for dRep, and one for gtdbtk.

The min_completeness and max_contamination thresholds are the requirements used to say that a bin qualifies to be a MAG. The default for dRep is 75% and 25%, respectively, but you may change this here. Note that checkm2 will output quality results for all bins, regardless of these thresholds, but the final MAGs must meet these requirements.

The 10 threads is sufficient for this step, and the two last settings should never be changed as long as you are on orion.

You run this script in the usual way by:

sbatch script_MAGmaker.sh`

NOTE: The script will by default allocate 75 gigabytes of memory:

#SBATCH --mem=75G                   # Adjust if oom kill

If you have many bins this may not be enough and your job will be killed by an Out-Of-Memory (OOM) error. Look for this in the log-file if it crashes, and then increase the memory allocation and re-start.

7.2 The `checkm2` output

We run checkm2 for quality assessment of all bins explicitly here. The output will be found in $out_folder/checkm/. The results from checkm2 are needed by dRep, and we need to run this first.

You may look into the file $out_folder/checkm/quality_report.tsv to inspect the results for all bins. This is a tab-delimited text file, and easy to read into R.

7.3 The `dRep` output

The output from dRep is found in $out_folder/dRep/. The fasta-files with the actual MAGs are first stored in out_folder/dRep/dereplicated_genomes/ but they will be moved up to the $out_folder once dRep has finished. Thus, you find your MAGs in out_folder. There may, of course, be zero MAGs if none of the bins meet the completeness and contamination criteria used!

In $out_folder/dRep/data_tables/ you find several comma-separated text files with table results that you easily read into R. The file Cdb.csv give you information about how the bins were clustered. Ideally, a ‘good’ bin should be ‘found’ by all software tools, and then cluster here. Note that this files lists results for all bins, not only those who meet the criteria to become MAGs. The file Widb.csv lists all bins and some score given to them by dRep.

Note that the cluster labels have two integers, e.g. 2_1. The first number indicates a primary cluster, i.e. clusters 2_1 and 2_2 are distinct, but since they share the same primary number, they are quite similar to each other. You may find such MAGs have very similar taxonomy (see below).

Some of this information you also find in the final table $out_folder/MAG_table.tsv, see below.

7.4 The `gtdbtk` output

We run the gtdbtk on the MAGs as input, to get some taxonomy for them. This uses the taxonomy of the GTDB (Genome Taxonomy Data Base), which is not identical to the NCBI Taxonomy.

From the output in the folder $out_folder/gtdbtk/ we are typically interested in the summary files. Look for files gtdbtk.bac120.summary.tsv and gtdbtk.ar53.summary.tsv. There are separate files for bacteria and archaea, and one or both may be lacking if no such organisms were annotated. It is typically the first two columns we are interested in these files:

The name of the MAG (user_genome), which is simply the fasta file name without extension
The assigned taxonomy (classification)

The taxonomy is a string that may look like this:

"d__Bacteria;p__Campylobacterota;c__Campylobacteria;o__Campylobacterales;f__Sulfurovaceae;g__Sulfurovum;s__Sulfurovum sp000010345"

In this case the classification is all the way down to species, but it may of course be shorter.

7.5 Summary table

After these software tools have been run, some R code produces a summary table in the $out_folder/MAG_table.tsv. This lists one row for each MAG, and with columns listing some information for each MAG obtained from the folders $out_folder/checkm/, $out_folder/dRep/ and $out_folder/gtdbtk/. You may of course add to this table by reading more of the output in these folders.

8 MAG coverages

Once some MAGs have been obtained it is natural to ask how abundant they are in the different samples. This is quantified by the coverage in each sample.

8.1 The `script_6_MAGcoverage.sh`

The coverage for each MAG in each sample is computed by this script. The basic input you need here is the reads for all samples and the MAG fasta files. Then reads from each sample are mapped to the MAGs, and coverages are computed, one value for each MAG in each sample. This is output as matrix, much like read count matrices you obtain from 16S sequencing, but the numbers are coverages instead of read counts.

In the Settings of this script you recognize some folder names used previously, i.e. $fastq_folder should be the reads folder (see mainscript.sh) and the $MAGs_folder should contain the MAGs (see script_5_MAGmaker.sh). You specify your $out_folder for the results of this step.

This script must be run three times, where you comment in/out different parts of the script each time.

8.1.1 PART 1

Comment in the PART 1 of the script, and comment out PART 2 and 3. Also, this part must be run once and only once, and in the heading of the script you comment out the array-jobs:

# #SBATCH --array=1-6       # one job per sample

Then you simply

sbatch script_6_MAGcoverage.sh

This first part just collects all contigs from all MAGs and make a bowtie2 index of them. Since we use three different tools for binning, the same contig may be found in two or more bins, and this is sorted out here.

8.1.2 PART 2

Comment in PART 2 and comment out PART1 and 3. Here you must use array-jobs! You now use one array job for each sample, just as you did in mainscript.sh. The reads from each sample are then mapped to the same bowie2 index, containing all contigs from all bins. This takes time, and this is why we split it by using array-jobs.

In the heading you comment in the use of array-job, like this:

#SBATCH --array=1-6       # one job per sample

Remember to edit the number of samples, here the code assume 6 samples. Once done, sbatch as before.

The log-files produced by this script are important! This is why this script will always output the log-files to a folder named coverage_log inside the folder you have your scripts. Never change this! The reason is that the total number of reads mapped is written by bowtie2 to these log files. This information is collected by this script and appear in the final output, see below.

8.1.3 PART 3

Once all the array-jobs are done, comment out PART 1 and 2 and comment in PART 3. Again you comment out the use of array-jobs, this part should be run once and only once, just like PART 1.

This part the script reads all results so far, and outputs two text files to the $out_folder:

MAG_coverages.tsv which is a tab-delimited text file. The first column contains a MAG_id text, and the remaining columns, one for each sample, contain the coverages in the various samples. Read it into R like any other tab-delimited file.
MAG_mapping_fraction.tsv which is another tab-delimited file with one value for each sample. This indicates how large fraction of the reads for that sample mapped to the contigs.

The coverages in MAG_coverages.tsv you may now use directly as a measure of ‘abundance’ for each MAG in each sample. You should transform them into relative values, i.e. such that all values for a sample (column) sum to 1.0.

What is the point of the MAG_mapping_fraction.tsv? Well, the MAGs are in general not the only genomes in your samples, and only some fraction of the total amount of reads will map to MAGs. Thus, the coverages do not reflect total abundance of your MAGs. The relative values you compute from MAG_coverages.tsv reflect relative abundance among the MAGs. If you want to say something about how abundant the MAGs are in the sample as a whole, you must multiply the relative abundances by the mapping fractions. As an example, say a MAG has relative abundance of 0.2 in some sample, based on the coverages (it has 20% of the sum of coverages for taht sample). But, for this sample the mapping fraction is 0.5. Then, this MAG actually has an abundance of 0.1 (0.2 * 0.5) in the total sample.

Actually, this simple adjustment assumes that the reads we did not map come from genomes of the same length as our MAGs, i.e. if 50% of the reads map this corresponds to 50% of the coverage. This is in most cases an OK assumption to make, but you realize there are some uncertainties here.

9 MAG annotations

This lst script is used for getting some functional annotation of the MAGs. We use the software DRAM for this, and this is the only step where we use conda instead of apptainer containers. The reason is the (silly?) way that the location of databases are implemented in DRAM.

9.1 The `script_7_MAGannotations.sh`

As for the coverages, this script must be run several times, or actually just two times. The first part uses array-jobs, while the last part is run once all array-jobs are done.

9.1.1 PART 1

Comment in PART 1 and comment out PART 2.

In PART 1 we use array-jobs, one job per MAG. Notice this, one per MAG, not one per sample! Make certain the heading specifiec this:

#SBATCH --array=1-101%10        # one job per MAG

Here it is obviously 101 MAGs, edit this accordingly.

The main job with annotation is to predict genes for each MAG and then search with these genes against some databases with known functional genes. This takes a long time. In DRAM you can enter a series of input MAGs, but they will then be treated sequentially, i.e. one by one. This is slow. We therefore instead split this into array-jobs, and annotate one MAG by each job, in parallel.

The output from this part you will find in folders $out_folder/DRAM1, $out_folder/DRAM2, $out_folder/DRAM3…etc.

9.1.2 PART 2

Comment out PART 1 and comment in PART 2. Also comment out the array-jobs in the heading, this part should be run once and only once.

This part of the script simply collects all results from the $out_folder/DRAM1, $out_folder/DRAM2, $out_folder/DRAM3… folders and write them to some files in the $out_folder. These files are the same as you would get if you run DRAM sequentially. Finally, the distill step of DRAM is run on all these results, and output to the folder $out_folder/distills/. In here you find some files with annotation information, see the DRAM manual for more details. Also, the file $out_folder/annotations.tsv i sa tab-delimited file that contains all annotation results in case you want to explore all details, not only the distilled results.

Illumina metagenome processing - MAGmachine

Lars Snipen

January 2025