Here is a ‘pipeline’ and some recommendations concerning the steps needed to reconstruct Metagenome Assembled Genomes (MAGs) from Illumina shotgun metagenome data. It is made for running on the orion HPC cluster, using mostly apptainer containers downloaded from galaxyproject.org.
All the scripts described below are found on orion under
/mnt/users/larssn/MIDIV/Whole_genome_sequencing/Illumina/MAGmachine/
.
The code you find under this folder must not be edited in this folder!
You should always copy the code to your own folder first, and then you may edit. Think of the code here as a template, that you give your personal touch once you have your own copy.
Make a folder under your HOME
folder somewhere. Move
into this folder (use cd
). Then copy the scripts:
cp /mnt/users/larssn/MIDIV/Whole_genome_sequencing/Illumina/MAGmachine/*.sh .
You may not need all scripts for all projects, but it is probably wise to copy all anyway.
In this pipeline all software tools used are avilable as
apptainer containers. These containers are currently found
under
/mnt/users/larssn/MIDIV/Whole_genome_sequencing/apptainer/
,
as .sif
files. This means we may move this entire lot into
saga on the sigma2 facilities, and run the same pipeline there.
Before you start processing the data you need to prepare a file with some basic information about each sample. We refer to this as the metadata file.
We have our local way of organizing data, and these scripts are designed for this organization. This means
Inside the SequencingRunID folder we expect to find an Excel-file in addition to all the fastq files. This is the metadata file. The content of this metadata file must follow some rules:
Rawfile_R1
and
Rawfile_R2
in the metadata file, and these must list the
fastq file names for each sample. NOTE: No paths, just the name of the
files, e.g. SRR353639_R1.fq.gz
.SampleID
with a single
text that uniquely identifies each sample. This will be used as part of
file names, and therefore avoid special characters like /
.
Use only (english) letters, numbers or underscore (_
)!SequencingRunID
with the
name of the folder, as explained above. This will be the same text in
every row.ProjectID
with the name of
the project, as explained above. This will be the same text in every
row. Again, this will also be used as part of file names, and avoid
special characters. Use only (english) letters, numbers and
underscore!Here is an example of how the metadata file may look like, at least
parts of it:
Note that this file may, and should, contain many more columns than
the 5 we mentioned above. In this picture we see a column named
RefID
as well, and there are many more, containing data
about the data. But, for this pipeline only the 5 columns mentioned
specifically above (`ProjectID
,
SequencigRunID
, SampleID
,
Rawfile_R1
, Rawfile_R2
) are required.
The entire procedure consists of four basic steps who are done sample-by-sample, followed by some additional steps working on the collected results from all samples.
The first part we describe here is a processing of the raw reads which is always done sample-by-sample. This part requires comparatively little memory and time, mainly due to parallel computing using array jobs. Thus, you always start by running all samples through these steps.
mainscript.sh
This part of the pipeline is started from the shell-script named
mainscript.sh
. Thus, you always start any jobs by
sbatch mainscript.sh
from your terminal, and being in the folder where
mainscript.sh
was stored (see above). Then this script will
call on the numbered scripts (script_0_fastqc.sh
,
script_1_filter_trim.sh
, etc…) in the same folder as we go
along.
Typically, the mainscript.sh
is the only file you edit.
You may change the other scripts as well, if you know what you are
doing. We typically use RStudio to edit files, but any editor will do.
There is no R code you run explicitly in this pipeline, but the shell
scripts use R in some cases.
This pipeline uses array jobs. This means we run the
processing of several samples in parallel. The header of
mainscript.sh
looks something like this:
#!/bin/sh
#SBATCH --array=1-12%6 # The samples to process
#SBATCH --ntasks=10 # Threads. Use 20 for assembly, 10 otherwise
#SBATCH --nodes=1 # Never change this
#SBATCH --job-name=MAGmachine # Give your job a name
#SBATCH --mem=50G # Use 250G or more for assembly, 50 otherwise
#SBATCH --output=MAGmachine_%j_%a.log # Inspect logfile when something crashes, then delete it
#SBATCH --partition=smallmem,hugemem # We rarely use smallmem these days...
#SBATCH --time=24:00:00 # If your job takes more than 24 hours, consider sigma2!
You need to edit the very first line indicating the number array jobs
used here. In the code above it says
#SBATCH --array=1-12%6
, meaning we want to process sample
1,2,…,12, but never more than 6 at the time. These jobs will then be
started in parallel, which is of course much faster than
running one-by-one. Thus, you need to specify the number of samples you
have. Note that you do not need to process all samples in one go like
this, you may also for examples just process samples 1, 3 and 10 by:
#SBATCH --array=1,3,10
This is typically something we do when we, for some reason, need to re-compute something for these samples.
Note that each array-job produces its own log-file, names
MAGmachine_xxx_yyy.log
, where xxx
and
yyy
and integers. If a job crashes, inspect the
corresponding log-file to see what went wrong. Delete such log-files
after you are done.
Note also that not all array jobs may be started right away, but are waiting in the queue (pending). This depends on available resources. Especially the assembly (step 3) requires much resources and may wait in queue for a long time.
IMPORTANT The orion HPC is a smallish system. Never start many array jobs that will run for a long time, since this will block the system for others. There is a guideline:
Thus, since we are nice people we stick to this.
In the header you also specify how many threads to use and how much memory to reserve:
#SBATCH --ntasks=10 # Threads. Use 20 for assembly, 10 otherwise
#SBATCH --mem=50G # Use 250G or more for assembly, <100 otherwise
Use 10 threads, except for the assembly step, where you may increase
to 20. More threads means faster processing, but this levels out and
there is little to gain from using more than 20 threads. All steps
except the assembly requires relatively little memory. Her we allocate
50GB, which is usually plenty. For assembly you need to reserve at least
250GB, which is what metaspades
uses by default. We will
come back to this below.
The first chunk of code in mainscript.sh
is called
Settings. Here are the lines:
work_folder=$SCRATCH/work # All output (a lot!) will end up here
rawdata_folder=/mnt/smb/KBM_MIDIVlab/Rawdata/Whole_genome_sequencing/Illumina/<folder_name> # Raw fastq files are here
metadata_file=$rawdata_folder/metadata_Tutorials.xlsx # Excel file in rawdata_folder/
metadata_file_local=$work_folder/metadata_Tutorials.txt # Local copy in plain text format will be written here
threads=10
# human_index=/mnt/users/larssn/MIDIV/Whole_genome_sequencing/hostgenomes/bowtie2/human_genome
# salmon_index=/mnt/users/larssn/MIDIV/Whole_genome_sequencing/hostgenomes/bowtie2/salmon_genome
# mouse_index=/mnt/users/larssn/MIDIV/Whole_genome_sequencing/hostgenomes/bowtie2/mouse_genome
You need to specify your work_folder
, which is where
your output will end up. Since this pipeline produces a lot of output,
this requires much storage space. Use either a folder under your
$SCRATCH
for this, or use some project folder where you
have plenty of gigabytes available. Make a folder like this:
mkdir $SCRATCH/work
where you replace work
with whatever you want to name
this folder. You can visit or list the content of this folder, just like
any other folder.
The rawdata_folder
is where your raw data are stored.
This is typically some folder under
/mnt/smb/KBM_MIDIVlab/Rawdata/Whole_genome_sequencing/Illumina/
.
The metadata_file
is the required excel file in that
folder, with the proper columns following the MiDiv standard from summer
2022, see section above about this. The file named in
metadata_file_local
is a text-file version that will be
created by the pipeline. This typically has the same name as in
metadata_file
but where .xlsx
is replaced by
.txt
.
The threads
must always have the same value as in
#SBATCH --ntasks=10
in the header. Remember, if you change
to 20, edit both places.
Three lines are commented out here. They are locations of
bowtie2
indexes, used in step 2 below. But, step 2 is
only used when data are from some host, here either human,
salmon or mouse. If you are going to run step 2, and have data from
humans, comment in the line specifying the human_index
.
With data from environemnts (e.g. sediments) you may omit step 2 below,
and none of these indexes are needed.
Below we now describe the first steps called from
mainscript.sh
.
After the Settings you find first a chunk of R code for reading the meta data file:
module load R/4.3.3
Rscript -e "a <- commandArgs(trailingOnly = T)
if(!file.exists(a[2])){
cat('***********************************************************************************\n')
cat('Reading excel-file ', a[1], ' with metadata...\nwriting to text-file ', a[2], '\n')
cat('***********************************************************************************\n')
tb <- readxl::read_excel(a[1])
write.table(tb, file = a[2], sep = '\\t', quote = F, row.names = F)
}" $metadata_file $metadata_file_local
Here the Excel-file specified in metadata_file
is read,
and a text copy of it is written to your work_folder
. This
file is never over-written, and if you for some reason want to replace
it you must first delete the old one and then re-run this code.
After this you find a couple of code chunks with awk
code for reading the meta data. Never comment out this code!
Here we only inspect the reads using the fastqc
software:
./script_0_quality_inspection.sh $Rawfile_R1 $Rawfile_R2 $work_folder/quality_inspection $threads
The input are the two files of raw reads (content of variables
Rawfile_R1
and Rawfile_R2
), the folder to
output to (here $work_folder/quality_inspection
) and the
number of threads to use ($threads
). The folder you specify
as third argument here will be filled with one HTML-file for each fastq
file. You can inspect these files to see how the reads looks like
(number of readpairs, length, quality, adapters etc.).
It may be a good idea to run only this step on all reads before you
do anything more. This means you first comment out all later steps in
mainscript.sh
, i.e. scroll down and comment out all
lines.
You may also omit this step, since its output is not required for any other step.
Here we use the software bbduk
to
The code:
./script_1_filter_trim.sh $Rawfile_R1 $Rawfile_R2 $work_folder/fastq $file_stem $threads
is much like that in step 0, and you name the output folder in the
third argument ($work_folder/fastq
). This folder will now
be filled with fastq-files having names following the pattern:
ProjectID_SequencingRunID_Sample_ID
followed by a proper
extension. These three pieces of information are collected from the
corresponding columns of the metadata file. When pasted together like
this we refer to this as a file stem. Each sample then has a
unique file stem. Note that for each array job there will be a
corresponding file stem, and this identifies which sample is processed
by each array job.
There will typically be three files for each sample:
These files are always compressed to save disk space.
Inside the script_1_filter_trim.sh
you may edit the
settings if you like. As it is now, it will not filter very hard, but it
has proven to be hard enough in most cases. The Illumina data we get
these days are usually of very high quality.
At the end of this script you find an alternative code for using
another software trimmomatic
to do a similar job. This will
filter much harder. You then simply comment out the first part using
bbduk
, and comment in the latter part using
trimmomatic
.
Based on the inspection from step 0 or above, you may edit the script
script_1_filter_trim.sh
. Under Settings you may for example
trim the reads. It looks like this:
K=23 # Set to 21 for low quality data
h_dist=1 # Set to 2 for low quality data
Q_trim=20 # Quality limit for 3' end trimming
Q_discard=20 # Discard reads with average Quality below this
trim_left=0 # Trim this many bases off 5' end
trim_right=150 # Trim bases beyond this position
min_len=30 # Discard reads shorter than this
If you increase the trim_left=0
value to
trim_left=5
, you trim 5 bases off the start of the reads.
The trim_right=150
indicates from where trimming starts in
the right end. If reads are 150 bases long, no trimming is done here,
but with trim_right=145
we would also trim 5 bases off the
right end. Notice that if you have longer reads than 150, you
must increase this value accordingly!
This step is only required if your data are from some host organism, and you want to remove reads coming from this host. For samples from environments (e.g. sediments) this step is omitted.
Here is how it looks like if we want to remove reads from a human host:
./script_2_decontaminate.sh $work_folder/fastq $file_stem $human_index $threads
The software bowtie2
will map all reads (both paired and
unpaired) to the host genome, and using the software
samtools
to collect all reads that do not map and
write these to corresponding fastq-files in the same folder
($work_folder/fastq
in the code above). Thus, after this
step you should have the same set of fastq-files as you had after the
previous step, but they may have shrinked in size. Inspect the log-files
to see how many reads were mapped, this is reported by
bowtie2
. If a large percentage of the reads map, and is
then lost here, the samples are very contaminated.
As part of step 2 is also another read quality inspection, using
fastqc
. This will again produce an HTML file for each of
the R1 and R2 fastq-file, in the same folder as the fastq-files. You may
then inspect these and compare to those from step 0, to see if quality
has improved and how much data you have lost as paired-end (the unpaired
reads are still there). You may also omit this step (comment out) if you
feel this inspection is not needed.
As mentioned above, you typically run the steps down here in one session, using 10 threads and 50GB memory. Once these steps are done for all samples, you comment out these steps and start the assembly (step 3).
This is the step that takes most computational resources. You should typically process all samples through the above steps of the pipeline before you come here and start the assembly.
There are basically two ways ahead from here now
We describe how to do both below. In both cases you still use the
mainscript.sh
and array-jobs, but for co-assembly you need
to do some extra processing, as we will see.
You typically assemble each sample by itself if the samples are quite different from each other. This is often the case if we have samples from outdoor environments. This is also the easiest and most straightforward way of doing the assembly.
Note: We always do a de-replication of the results at the
end (see script_5_MAGmaker.sh
below), and if the same
genome is re-constructed in different samples, this will then be
detected. Thus, a sample-by-sample assembly is always meaningful, even
when samples are very similar.
Again you simply edit the mainscript.sh
, and use one
array-job for each sample. Comment out the previous steps (that you
should have completed), and comment in step 3:
./script_3_assembly.sh $work_folder/fastq $file_stem $work_folder/assembly $threads
The first two arguments identify the fastq files with reads to
assemble: The folder they are in and their file stem. The third argument
is again the ouput folder, here we have chosen
$work_folder/assembly
. The last argument is the number of
threads to use.
The code in script_3_assembly.sh
may be edited. It uses
metaspades
for assembly, and the amount of memory used is
by default 250GB. If you need more memory, you must change this inside
script_3_assembly.sh
in addition to editing the
mainscript.sh
heading (#SBATCH --mem=
).
This script will output one folder for each sample, inside the folder
given as third argument above. These folders all start with
tmp_
followed by the file stem. These folders you should
delete after the job is finished. It is not deleted by this pipeline,
since you may want to inspect some of its results. But, it occupies a
lot of space (metaspades
produces a lot of
ouput!), and must be deleted as soon as you can.
The only result copied from each tmp_
folders is the
fasta file with the final contigs. These are stored in the output
folder, with the file stem followed by _contigs.fasta
for
each sample. NB! There should be one such contigs-file for each sample.
If some are lacking, please inspect the corresponding log-file to see
what went wrong. Assembly is a big job with many steps, and plenty of
things could go wrong here. You may then need to re-start the assembly
for these samples (edit the --array=
argument in the
mainscript.sh
header, see above). Note that as long as you
keep the tmp_
folder, metaspades
may be
re-started from the last checkpoint, see the spades
manual.
Note that when running the assembly, you need to edit the
mainscript.sh
to
Finish the assembly for all samples before you proceed.
In cases when some samples contain more or less the same organisms, it may be beneficial to merge the reads from several samples into super-samples. This will give you more coverage per genome, and may result in better contigs. Note that if these super-samples become too big, it may not be possible to assemble them on orion. It simply takes too much memory. It will also take a lot more time.
In order to do a co-assembly, your metadata_file
(the
excel-file, see above) must contain a column named
SuperSampleID
. This must contain text indicating which of
the original samples should be grouped into super-samples. In the
Tutorial here, the SampleID
and the
SuperSampleID
columns look like this:
SampleID SuperSampleID
sample_1 low_diversity
sample_2 low_diversity
sample_3 low_diversity
sample_4 high_diversity
sample_5 high_diversity
sample_6 high_diversity
This indicates that the 6 samples should be grouped into 2 super-samples, where the first 3 and the last 3 samples should be merged, and the assembly done for these 2 super-samples.
To create the super-samples, use the script
script_super_samples.sh
. This is not called from
mainscript.sh
! The reason is this is done once, not
sample-by-sample. In this script you need to edit the Settings part:
work_folder=work # copied from mainscript.sh
metadata_file_local=$work_folder/metadata_Tutorials.txt # copied from mainscript.sh
fastq_folder=$work_folder/fastq # copied from mainscript.sh step 1 or 2
cofastq_folder=$work_folder/cofastq # new folder with new fastq-files
metadata_file_coassembly=$work_folder/metadata_Tutorials_coassembly.txt # new metadata file
The setting of work_folder
and
metadata_file_local
is identical to what you have in
mainscript.sh
, copy from there. Also, you need to specify
where the merged fastq-files after step 3 above are.
Next, you need to specify the folder to store the super-samples
(cofastq_folder
), i.e. the new fastq-files. This folder
will be created. Finally, you also need a new copy of the metadata file,
in metadata_file_coassembly
. The reason for the latter is
that here the SuperSampleID
is now taking over the role as
SampleID
, and we should store it under a new name to avoid
messing up the original. You may need to re-run the early steps later,
and then you need the original metadata file again.
You run this by
sbatch script_super_samples.sh
The entire merging is done in R. This may seems strange, but the
file.append()
function in R is super fast and can actually
append compressed files. Avoiding the need to specifically
de-compress/compress saves a lot of time!
Once you have the super-samples, you need to edit the
mainscript.sh
slightly before you proceed:
mainscript.sh
header.metadata_file_local
must now be the new
metadata-file created by script_super_samples.sh
. Edit this
in the Settings of mainscript.sh
.The call to the step 3 script in mainscript.sh
is
identical to what we did for the sample-by-sample assembly above, just
with (potentially) other folder names:
./script_3_assembly.sh $work_folder/cofastq $file_stem $work_folder/coassembly $threads
Notice we changed the folder name in the second argument, to use the
reads in $work_folder/cofastq
this time. This must be same
folder you used as cofastq_folder
in
super_samples.sh
. We also output to a folder named
$work_folder/coassembly
here.
You may have to increase the memory from defualt 250G to 500G or even more, if the super samples are huge. Again, consider using sigma2 instead of orion if the super-samples are very large.
Notice that in the last step below, you now continue to work on the coassembly of the super-samples, with one array job for each super-sample.
mainscript.sh
This is the last part where you use the mainscript.sh
solution. This means you process everything sample-by-sample, or
super-sample-by-super-sample if you used co-assembly, see above.
Binning means clustering the contigs that is believed to come from the same genome. Thus, a bin is a collection of contigs, and is an estimate of a genome. The input to this step is
You comment out all steps in the mainscript.sh
, except
for the last:
./script_4_binning.sh $work_folder/assembly $work_folder/fastq $file_stem $work_folder/binning $threads
First you specify the folder with the contigs, next the folder with all the reads, then the file stem, output folder and threads as before.
Notice that if we did a co-assembly like described above, we must use
$work_folder/coassembly
as the second argument and
$work_folder/cofastq
as the third.
For binning we use three different software tools:
metabat2
maxbin2
concoct
This means we make three sets of bins from all contigs. Binning is a step where there is no clear ‘first choice’ software, and the results we get from different tools may sometimes vary a lot. This is why we use three tools here, and consider all their output.
Then, once these are done, we use the software DAS_Tool
to make an ensemble binning. This essentially means we take all
bins from the three tools above, and try to reduce it to one set of
bins. A score for each bin is computed based on some reference
single-copy genes. Bins are then ranked by this score, the best first.
If the best bin is above a given score-threshold it is accepted as a
final bin. Contigs belonging to this bin is then removed from all other
bins, and all remaining bins are scored again. Then this selection is
repeated as long as the best remaining best bin is above the
score-threshold. Final bin files with _sub
in their names
have been reduced through this process, i.e. lost some contigs compared
to the original binning. This process may reduce the contamination in
too large bins.
If you inspect the script_4_binning.sh
you find the
Settings section:
maxbin_markerset=107 #40 # consider using 40 for samples with many 'new' taxa
concoct_Kmer_length=4
concoct_read_length=150
You may want to use the smaller (40) markeset for maxbin in cases
where you have data with many ‘new’ taxa, e.g. environmental samples.
The concoct
settings you may also edit if reads are
longer/shorter than 150 bases.
Only contigs of at least 1000 bases are ever included in some bin,
for metabat2
this threshold is actually hardcoded to 1500
bases.
The final output will be in the folder you give as the fourth
argument ($work_folder/binning
in the command line above).
Each sample will be processed in a subfolder under this named by its
file_stem
. After completion, you find the raw bins in this
subfolder, but the final bins from DAS_Tool
are all moved
to the output folder (e.g. $work_folder/binning/
). In the
subfolder you also find a summary table from DAS_Tool
that
lists some statistics for each final bin, in case you are interested in
this. All other files in this subfolder are deleted.
Upon completion the output folder will (hopefully) contain some fasta files with the bins. In the file names you find some numbers indicating the bin. Note that these numbers are just random labels, bin number 1 from one software does not necessarily correspond to bin 1 from the other.
Note that there may be zero bins! All three tools have some
requirements for making a bin, and the DAS_Tool
has a
score-threshold, and if the contigs are too poor there will be no
bins.
Note that this step does not require a lot of resources, and you
should go back to using 10 threads and only 50GB of memory, as in the
first steps above. It will, however, take some time since there are 7
different software tools involved here (seqkit
,
bowtie2
, samtools
, metabat2
,
maxbin2
, concoct
, DAS_Tool
) in
addition to some R code.
You may run all steps in the pipeline in one go, but in most cases we have found it convenient to split it into three separate parts:
sbatch mainscript.sh
.mainscript.sh
prior to starting this. Comment out all
except step 3, save and sbatch
again. Remember, never more
than 6 array jobs at the time on orion!
work_folder
The work_folder
set in the mainscript.sh
will fill up with a lot of folders/files. It is a good idea to clean
this to avoid filling up your disk quota. Delete all the
tmp_
folders under $work_folder/assembly/
(or
$work_folder/coassembly/
). In the binning folder you may
also delete the files who are not bin-files, but most of it should have
been deleted by the binning script code
The files you always keep are:
$work_folder/fastq/
and/or
$work_folder/cofastq/
. You will need these below.$work_folder/assembly/
. These are
smallish fasta-files.$work_folder/binning/
. These are all
fasta-files ending with .fasta
.
The basic output from the steps above are the binned contigs found in
some folder ($work_folder/binning/
from
mainscript.sh
). From now on we work on the entire set of
bins, and there is no sample-by-sample processing anymore. Thus, we no
longer use the mainscript.sh
.
In the script script_5_MAGmaker.sh
we now do the
following processing of all bins:
checkm2
dRep
gtdbtk
script_5_MAGmaker.sh
You need to edit the Settings in this script:
work_folder=work
bins_folder=$work_folder/binning # folder with all bins from script_5_binning.sh
out_folder=$work_folder/MAGs # all output will end here
min_completeness=75 # minimum percent completeness required for MAG (default 75)
max_contamination=25 # maximum percent contamination required for MAG (default 25)
threads=10
You need of course to specify where your bins are, in
bins_folder
. The out_folder
is where you MAG
results will end up. Inside this folder three new folders will be
created, one for checkm2
one for dRep
, and one
for gtdbtk
.
The min_completeness
and max_contamination
thresholds are the requirements used to say that a bin qualifies to be a
MAG. The default for dRep
is 75% and 25%, respectively, but
you may change this here. Note that checkm2
will output
quality results for all bins, regardless of these thresholds, but the
final MAGs must meet these requirements.
The 10 threads is sufficient for this step, and the two last settings should never be changed as long as you are on orion.
You run this script in the usual way by:
sbatch script_MAGmaker.sh`
NOTE: The script will by default allocate 75 gigabytes of memory:
#SBATCH --mem=75G # Adjust if oom kill
If you have many bins this may not be enough and your job will be killed by an Out-Of-Memory (OOM) error. Look for this in the log-file if it crashes, and then increase the memory allocation and re-start.
checkm2
outputWe run checkm2
for quality assessment of all bins
explicitly here. The output will be found in
$out_folder/checkm/
. The results from checkm2
are needed by dRep
, and we need to run this first.
You may look into the file
$out_folder/checkm/quality_report.tsv
to inspect the
results for all bins. This is a tab-delimited text file, and easy to
read into R.
dRep
outputThe output from dRep
is found in
$out_folder/dRep/
. The fasta-files with the actual MAGs are
first stored in out_folder/dRep/dereplicated_genomes/
but
they will be moved up to the $out_folder
once
dRep
has finished. Thus, you find your MAGs in
out_folder
. There may, of course, be zero MAGs if none of
the bins meet the completeness and contamination criteria used!
In $out_folder/dRep/data_tables/
you find several
comma-separated text files with table results that you easily read into
R. The file Cdb.csv
give you information about how the bins
were clustered. Ideally, a ‘good’ bin should be ‘found’ by all software
tools, and then cluster here. Note that this files lists results for
all bins, not only those who meet the criteria to become MAGs.
The file Widb.csv
lists all bins and some score given to
them by dRep
.
Note that the cluster labels have two integers,
e.g. 2_1
. The first number indicates a primary
cluster, i.e. clusters 2_1
and 2_2
are
distinct, but since they share the same primary number, they are quite
similar to each other. You may find such MAGs have very similar taxonomy
(see below).
Some of this information you also find in the final table
$out_folder/MAG_table.tsv
, see below.
gtdbtk
outputWe run the gtdbtk
on the MAGs as input, to get some
taxonomy for them. This uses the taxonomy of the GTDB (Genome Taxonomy
Data Base), which is not identical to the NCBI Taxonomy.
From the output in the folder $out_folder/gtdbtk/
we are
typically interested in the summary files. Look for files
gtdbtk.bac120.summary.tsv
and
gtdbtk.ar53.summary.tsv
. There are separate files for
bacteria and archaea, and one or both may be lacking if no such
organisms were annotated. It is typically the first two columns we are
interested in these files:
user_genome
), which is simply the
fasta file name without extensionclassification
)The taxonomy is a string that may look like this:
"d__Bacteria;p__Campylobacterota;c__Campylobacteria;o__Campylobacterales;f__Sulfurovaceae;g__Sulfurovum;s__Sulfurovum sp000010345"
In this case the classification is all the way down to species, but it may of course be shorter.
After these software tools have been run, some R code produces a
summary table in the $out_folder/MAG_table.tsv
. This lists
one row for each MAG, and with columns listing some information for each
MAG obtained from the folders $out_folder/checkm/
,
$out_folder/dRep/
and $out_folder/gtdbtk/
. You
may of course add to this table by reading more of the output in these
folders.
Once some MAGs have been obtained it is natural to ask how abundant they are in the different samples. This is quantified by the coverage in each sample.
script_6_MAGcoverage.sh
The coverage for each MAG in each sample is computed by this script. The basic input you need here is the reads for all samples and the MAG fasta files. Then reads from each sample are mapped to the MAGs, and coverages are computed, one value for each MAG in each sample. This is output as matrix, much like read count matrices you obtain from 16S sequencing, but the numbers are coverages instead of read counts.
In the Settings of this script you recognize some folder names used
previously, i.e. $fastq_folder
should be the reads folder
(see mainscript.sh
) and the $MAGs_folder
should contain the MAGs (see script_5_MAGmaker.sh
). You
specify your $out_folder
for the results of this step.
This script must be run three times, where you comment in/out different parts of the script each time.
Comment in the PART 1 of the script, and comment out PART 2 and 3. Also, this part must be run once and only once, and in the heading of the script you comment out the array-jobs:
# #SBATCH --array=1-6 # one job per sample
Then you simply
sbatch script_6_MAGcoverage.sh
This first part just collects all contigs from all MAGs and make a
bowtie2
index of them. Since we use three different tools
for binning, the same contig may be found in two or more bins, and this
is sorted out here.
Comment in PART 2 and comment out PART1 and 3. Here you must use
array-jobs! You now use one array job for each sample, just as
you did in mainscript.sh
. The reads from each sample are
then mapped to the same bowie2
index, containing all
contigs from all bins. This takes time, and this is why we split it by
using array-jobs.
In the heading you comment in the use of array-job, like this:
#SBATCH --array=1-6 # one job per sample
Remember to edit the number of samples, here the code assume 6
samples. Once done, sbatch
as before.
The log-files produced by this script are important! This is why this
script will always output the log-files to a folder named
coverage_log
inside the folder you have your scripts. Never
change this! The reason is that the total number of reads mapped is
written by bowtie2
to these log files. This information is
collected by this script and appear in the final output, see below.
Once all the array-jobs are done, comment out PART 1 and 2 and comment in PART 3. Again you comment out the use of array-jobs, this part should be run once and only once, just like PART 1.
This part the script reads all results so far, and outputs two text
files to the $out_folder
:
MAG_coverages.tsv
which is a tab-delimited text file.
The first column contains a MAG_id
text, and the remaining
columns, one for each sample, contain the coverages in the various
samples. Read it into R like any other tab-delimited file.MAG_mapping_fraction.tsv
which is another tab-delimited
file with one value for each sample. This indicates how large fraction
of the reads for that sample mapped to the contigs.The coverages in MAG_coverages.tsv
you may now use
directly as a measure of ‘abundance’ for each MAG in each sample. You
should transform them into relative values, i.e. such that all values
for a sample (column) sum to 1.0.
What is the point of the MAG_mapping_fraction.tsv
? Well,
the MAGs are in general not the only genomes in your samples, and only
some fraction of the total amount of reads will map to MAGs. Thus, the
coverages do not reflect total abundance of your MAGs. The
relative values you compute from MAG_coverages.tsv
reflect
relative abundance among the MAGs. If you want to say something
about how abundant the MAGs are in the sample as a whole, you must
multiply the relative abundances by the mapping fractions. As an
example, say a MAG has relative abundance of 0.2 in some sample, based
on the coverages (it has 20% of the sum of coverages for taht sample).
But, for this sample the mapping fraction is 0.5. Then, this MAG
actually has an abundance of 0.1 (0.2 * 0.5) in the total sample.
Actually, this simple adjustment assumes that the reads we did not map come from genomes of the same length as our MAGs, i.e. if 50% of the reads map this corresponds to 50% of the coverage. This is in most cases an OK assumption to make, but you realize there are some uncertainties here.
This lst script is used for getting some functional
annotation of the MAGs. We use the software DRAM
for
this, and this is the only step where we use conda
instead
of apptainer
containers. The reason is the (silly?) way
that the location of databases are implemented in DRAM
.
script_7_MAGannotations.sh
As for the coverages, this script must be run several times, or actually just two times. The first part uses array-jobs, while the last part is run once all array-jobs are done.
Comment in PART 1 and comment out PART 2.
In PART 1 we use array-jobs, one job per MAG. Notice this, one per MAG, not one per sample! Make certain the heading specifiec this:
#SBATCH --array=1-101%10 # one job per MAG
Here it is obviously 101 MAGs, edit this accordingly.
The main job with annotation is to predict genes for each MAG and
then search with these genes against some databases with known
functional genes. This takes a long time. In DRAM
you can
enter a series of input MAGs, but they will then be treated
sequentially, i.e. one by one. This is slow. We therefore instead split
this into array-jobs, and annotate one MAG by each job, in parallel.
The output from this part you will find in folders
$out_folder/DRAM1
, $out_folder/DRAM2
,
$out_folder/DRAM3
…etc.
Comment out PART 1 and comment in PART 2. Also comment out the array-jobs in the heading, this part should be run once and only once.
This part of the script simply collects all results from the
$out_folder/DRAM1
, $out_folder/DRAM2
,
$out_folder/DRAM3
… folders and write them to some files in
the $out_folder
. These files are the same as you would get
if you run DRAM
sequentially. Finally, the
distill
step of DRAM
is run on all these
results, and output to the folder $out_folder/distills/
. In
here you find some files with annotation information, see the
DRAM
manual for more details. Also, the file
$out_folder/annotations.tsv
i sa tab-delimited file that
contains all annotation results in case you want to explore all details,
not only the distilled results.