Here are some ‘pipelines’ and recommendations concerning the initial processing of Illumina data from amplicon sequencing. By amplicon sequencing we refer to data where a marker gene has been amplified, and then sequenced, typically the 16S rRNA gene.
The scripts and data we refer to here can be copied from the folder
/mnt/users/larssn/MIDIV/Amplicon_sequencing/Illumina/
when
you are on orion.
You must always copy the code to your own folder first, and then you may edit. Think of the code here as a template, that you give your personal touch once you have your own copy.
It is also recommended that you copy all the
apptainer containers you use (from
/mnt/users/larssn/MIDIV/Amplicon_sequencing/apptainer/
) to
your own project folder, and edit all paths to use these local copies.
In this way your code is reproducible forever after. The container
versions in
/mnt/users/larssn/MIDIV/Amplicon_sequencing/apptainer/
will
change over time, and we do not store older versions.
From illumina paired-end sequencing the first step we do is typically a de-mutliplexing. At MiDiv we add some extra barcode sequences to the R1-reads in order to have more samples per sequencing run. De-multiplexing means we first split the reads into different files according to this barcode informatio.
Here is a short video illustrating the steps explained below.
MIDIV_DEMULTIPLEX.sh
scriptThis is the shell script you need. Move to the folder where you want to do the analysis, and copy this script to that folder. Here is how you do it on orion:
cp /mnt/users/larssn/MIDIV/Amplicon_sequencing/Illumina/MIDIV_DEMULTIPLEX.sh .
Open the script in RStudio. The first lines of code (after the SLURM heading and introductory text) should look something like this:
rawdata_folder="/mnt/smb/KBM_MIDIVlab/Rawdata/Amplicon_sequencing/Illumina/20220609_ampliconTutorial/"
metadata_file=$rawdata_folder/metadata_Tutorials.xlsx
fastq_folder=SRA_ready
tmp_folder=tmp
vsearch_app="apptainer exec /mnt/users/larssn/MIDIV/Amplicon_sequencing/apptainer/vsearch:2.28.1--h6a68c12_0.sif"
fastqc_app="apptainer exec /mnt/users/larssn/MIDIV/Amplicon_sequencing/apptainer/fastqc:0.12.1--hdfd78af_0.sif"
threads=10
The lines where you need to edit are those under the
### Settings
in the script above. The
rawdata_folder
must contain a text with the proper path to
the raw data folder with raw fastq files you are going to process. This
is typically found under
/mnt/smb/KBM_MIDIVlab/Rawdata/Amplicon_sequencing/Illumina/
.
The name of this folder is what we refer to as the
SequencingRunID
. In the template script above this is
20220609_ampliconTutorial
, and you replace this with your
folder name.
Inside this folder, there must be an excel-file with
.xlsx
extension in the metadata. This is a spreadsheet with
one row for each sample, and a number of columns. We have a rather
specific format for this file, see
/mnt/smb/KBM_MIDIVlab/Rawdata/
for a template file with
more information on this. In metadata_file
you enter the
name of this file, which is then expected to be found in the
rawdata_folder
.
The output will be in the folder you specify in
fastq_folder
. This folder will be created if it does not
already exist. The tmp_folder
is just used for some
temporary output, and should be deleted by you after you are done.
Finally, it is specified how to run the software vsearch
and fastqc
on this system. Both are used at the end of this
script, to make some assessment of the reads. The text in
vsearch_app
specifies how to run vsearch
software on this system, and similar for fastqc_app
. Here
we use apptainer containers. If you have installed these
software tools locally (on your own PC), you just replace this text by
the path to their executables. You may omit the use of these tools, this
is not part of the de-multiplexing itself, just comment out the
corresponding parts at the end of this script.
The demultiplexing itself is done by some R code:
library(tidyverse)
library(readxl)
library(midiv)
args <- commandArgs(trailingOnly = T)
metadata.file <- args[1]
rawdata.folder <- args[2]
fastq.folder <- args[3]
metadata.tbl <- read_excel(metadata.file) %>%
mutate(filename = str_c(ProjectID, '_', SequencingRunID, '_', SampleID, '_R1.fastq.gz')) %>%
mutate(filename2 = str_c(ProjectID, '_', SequencingRunID, '_', SampleID, '_R2.fastq.gz'))
dmltplx.tbl <- demultiplex(metadata.tbl,
in.folder = rawdata.folder,
out.folder = fastq.folder)
metadata.tbl <- full_join(metadata.tbl, dmltplx.tbl, by = c('ProjectID', 'SequencingRunID', 'SampleID'))
write_delim(metadata.tbl, delim = '\\t', file = str_replace(basename(metadata.file), 'xlsx', 'txt'))
In the script this is given as input to Rscript -e
and
with three input arguments specifying the metadata-file, the rawdata
folder and the temporary folder.
Notice you need some R packages. The tidyverse
,
readxl
and microseq
you install from CRAN in
the standard way. The last one, midiv
, is our own small
package. This is only found on GitHub, and you install it by running
if(!require(devtools)) install.packages("devtools")
devtools::install_github("larssnip/midiv")
directly in the Console window in RStudio. NB! This R package changes from time to time, make certain you have the latest version by repeating the above installation before you start.
NOTE 1: To read compressed fastq/fasta files
(readFastq()
and readFasta()
) you also need
the R-package R.utils
found at CRAN.
NOTE 2: Make certain you have the midiv
R package
version 2.2.0 or later, since this is the version where primers are
trimmed during de-multiplexing. We strongly recommend this, since this
will make the resulting fastq-files ready for upload to SRA later (hence
the name SRA_ready
for the output folder).
To load it into the SLURM queue you simply
sbatch MIDIV_DEMULTIPLEX.sh
in your Terminal and the job should appear listed in the queue under
the name demultiplex
.
The script is set up to use 10 threads and 50GB memory. You hardly need to change this, regardless of data set size, since more files will just take longer time, not more memory.
The folder you specified as fastq_folder
will be
gradually filled up with new fastq-files. These will get names
concatenating the
separated by "_"
, for each sample in the
metadata.file
. There will be 2 fastq files for each sample
(R1 and R2 reads).
In addition, the fastq_folder
is also filled by reports
from the fastqc
software, one HTML-file for each fastq
file. Inspect these to see the lengths and quality of the reads in the
various de-multiplexed files. You may then delete these HTML-files.
After the de-multiplexing, the script will also output a plot, as a PDF-file in the work folder. This give you some information about reads and amplicons lengths you may need later. Here is an example:
The two upper panels show histograms of the length of the merged reads and the length of the overlapping regions. The lower panel shows how the 10% quantile of the read quality score distribute across the last bases of both R1 and R2 reads. You typically want to trim the 3’ ends of the reads. Here you can see how much you may trim (upper panels) and how much bad quality you remove by trimming. Note, you typically trim the R2 reads more than the R1 reads, since they typically have poorer quality.
Finally, the metadata table get a new column called
n_readpairs
, containing the number of read pairs for each
sample. This table is then written to a local text file having the same
name as the original, but replacing the extension .xlsx
with .txt
. This text file is what you use as input to the
downstream pipelines.
In the folder
/mnt/users/larssn/MIDIV/Amplicon_sequencing/Illumina/deprecated/
you find an R script version of the shell script above
(MIDIV_DEMULTIPLEX.R
). This is no longer updated on orion
due to problems running apptainer
from inside R (which is
also an apptainer
container). If you want to do the
de-multiplexing on a Windows computer you may copy and edit this R
script, but you are on your own! We make no guarantees on how it will
work.
After de-mutliplexing we process the reads for each sample in order to
There are several tools for doing this job. We first describe the
VSEARCH
tool, and then some alternatives to this in later
sections. The VSEARCH
software is open source and free,
developed mainly by people in Oslo. You may install this on your own
(Windows) computer as well, if you like. Below we focus on how to use
this on orion.
Here is a short video illustrating the steps explained below.
PIPELINE_VSEARCH.sh
The VSEARCH
pipeline contains two slightly different
procedures (identity clustering and denoising) for finding the centroids
and their corresponding read counts, and both are used here, producing
two sets of outputs.
The shell script you need is found in
/mnt/users/larssn/MIDIV/Amplicon_sequencing/Illumina/PIPELINE_VSEARCH.sh
.
Copy this in the same way you copied the de-multiplexing script above.
Open it in RStudio. After the SLURM heading and the introductory text,
you should find the section
##############
### Settings
###
vsearch_app="apptainer exec /mnt/users/larssn/MIDIV/Amplicon_sequencing/apptainer/vsearch:2.28.1--h6a68c12_0.sif"
metadata_file=metadata_Tutorials.txt # The sample table
fastq_folder=SRA_ready # Folder with (input) fastq files
tmp_folder=tmp_vsearch # Temporary output folder
threads=10 # Must match --ntasks in heading
trim_right_R1=20 # Trim off 3' ends before merging
trim_right_R2=60 # Trim off 3' ends before merging
min_read_length=200 # Minimum length of R1 or R2 read after trimming, discard those shorter
max_error_probability=0.01 # Average error prob tolerated, P=0.01 corresponds to Q=20
min_size=2 # Minimum copy number of centroid sequence
OTU_identity=0.97 # identity threshold
unoise_alpha=2.0 # UNOISE model parameter
This is where you may edit.
The vsearch_app
is the path to the executable. Here we
use an apptainer
container on orion.
The metadata_file
must show the path and name of the
text version of the metadata. This should be produced the
de-multiplexing script, see above. Also, the fastq_folder
is the same as described for the de-multiplexing. This should now
contain 2 fastq files for each sample.
The tmp_folder
is just a folder to store temporary files
during processing. This folder should be deleted in the end, when
everything is done. It may be nice to inspect its content if something
crashes, or to see that files are created/deleted during processing.
The threads
is the number of threads used by the
processors, and can safely be kept at 10 on orion.
The trim_right_R1
and trim_right_R2
sets
how many bases to trim off the 3’ end of R1 and R2 reads before merging.
During de-mutliplexing you should get the file
trimming_guide.jpg
From this you know approximately how
much the R1 and R2 reads overlap. If the overlap is much longer than 25,
you can afford to trim off some of the ends, and still have a long
overlap. The reason we trim is that reads are always of poorest quality
towards the end. By trimming away this, more reads tend to merge. Note
that trimming too hard may result in many merged reads, but many badly
merged reads. This is something you typcally edit!
The max_error_probability
is a threshold for read
quality filtering. Reads with average error probability above this are
discarded. A quality score of 20 corresponds to error probability of
0.01
, and this seems like a natural choice. Set this lower
to filter harder or higher to filter weaker. This replaces the
maxee
parameter used in dada2 (which is read length
dependent). You rarely edit this.
The min_size
is the minimum number of copies a centroide
sequence can have. This should be 2
or more. Singeltons
(reads only seen once) cannot be trusted. In fact, the UNOISE procedure
(see below) implicitly sets this to 8! Note that this is for finding the
centroids only. All reads may be assigned to an OTU in the end. You
rarely edit this.
The OTU_identity
is primarily used for assigning reads
in the end. Regardless of how the centroids were found (clustering at
0.97 or denoising) the final assignment of reads to centroids use this
identity threshold. This you may very well edit.
The unoise_alpha
is the \(\alpha\) parameter in the UNOISE denoising
algorithm. This adjusts how ‘steep’ the copy number should drop for
‘children’ that belongs to a centroid ‘mother’ sequence. See the UNOISE
paper for the details on this. You rarely edit this.
Edit this part of the shell script, and save it. Run it by:
sbatch PIPELINE_VSEARCH.sh
directly from the Terminal.
The very last step in PIPELINE_VSEARCH.sh
is a call to R
in order to update the metadata file. This requires the
midiv
R package again, see the section on
MIDIV_DEMULTIPLEX.sh
above.
First, the data for each sample is trimmed, merged and filtered. Note
that no primer-removal is done here now! This is different from earlier
versions. The primer-removal should now be part of the de-multiplexing
step. The results are found in a fasta-files with the name using the
SampleID
for each sample, and the extension
.fasta
. These are found in the tmp_folder
.
Next, all reads from all samples are collected in one big file
(all.fasta
), and then all unique sequences with minimum
min_size
copies are found
(all_derep_minsize.fasta
). These files are also in
tmp_folder
. All these files may be deleted once the
pipeline is finished.
Next, the standard OTU clustering at 97% identity is done, including chimera filtering, as well as the UNOISE denoising algorithm.
Finally some R code is run to update the metadata table.
From the 97% identity (or whatever identity you choose) clustering we get the two files:
centroids_vsearch_OTU.fasta
readcounts_vsearch_OTU.tsv
From the denoising, the corresponding files are named
centroids_vsearch_ZOTU.fasta
readcounts_vsearch_ZOTU.tsv
since the centroids are often referred to as ZOTUs in this case.
In addition, the metadata table is updated with some new columns:
n_vsearch_merged
lists the number of merged
reads for each samplen_vsearch_reads_OTU
is the final number of
reads assigned to the 97% OTUs for each samplen_vsearch_reads_ZOTU
is the final number of
reads assigned to the Z-OTUs for each sampleThe updated metadata table is store as
metadata.tbl.RData
file in the tmp_folder
. It
does not overwrite the original metadata text file!
Note that in both the OTU
and ZOTU
result
files, the OTUs inside are always named OTU1
,
OTU2
,…etc regardless of which file you look into. Thus,
even the ZOTU
results use the term OTU
. This
is to make life easier when it comes to data analysis later. We always
tag the OTUs/ZOTUs/ASVs or whatever people call them as OTUs!
In the folder
/mnt/users/larssn/MIDIV/Amplicon_sequencing/Illumina/deprecated/
you find the R script VSEARCH_PIPELINE.R
for running the
VSEARCH
pipeline from R. In case you want to run
VSEARCH
locally on a Windows computer, this is a way to
proceed. We no longer update this, and it will probably not work
properly unless you edit parts of it.
In the same folder you will also find the R script
USEARCH_PIPELINE.R
. This is very similar to the
VSEARCH
but uses the ‘original’ USEARCH
software instead. Again, this is no longer updated on orion.
The dada2
is a popular alternative for processing
amplicon data. It runs entirely in R, and you need the
dada2
R package installed. This you find at Bioconductor,
see http://www.bioconductor.org/packages/release/bioc/html/dada2.html.
Here is also the command needed to install this package:
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("dada2")
Run this directly in the Console window of RStudio. You should probably run the newest available version of R to make this pain free.
Copy the file
/mnt/users/larssn/MIDIV/Amplicon_sequencing/Illumina/PIPELINE_DADA2.R
to your working folder. This contains all the R code for running the
dada2
pipeline on the same input as using
VSEARCH
above. On orion, you should also copy the shell
script run_dada2.sh
.
Here is a short video illustrating the steps explained below.
The settings for dada2
are slightly different from the
VSEARCH
above, but we have tried to make them as similar as
possible:
###############
### Settings
###
metadata.file <- "metadata_Tutorials.txt" # File produced by MIDIV_DEMULTIPLEX.sh
fastq.folder <- "SRA_ready" # Folder with de-multiplexed fastq-files, one pair for each sample
tmp.folder <- "tmp_dada2" # Folder for temporery output, delete this after you are done
trim.right <- c(20,60) # Trim off 3' end before merging
max.error.probability <- c(0.01, 0.02) # Average error prob tolerated, P=0.01 corresponds to Q = 20
threads <- 10 # Must be the same as --ntasks in shellscript.sh. In Window set to FALSE
The folders are as for VSEARCH
.
The trim.right
is here a vector with 2 values, for
VSEARCH
this were two separate variables. The first value
is how much you trim the R1 reads, the second how much to trim the R2
reads.
The max.error.probability
values you specify here is not
used by dada2
, but we ‘translate’ it into the corresponding
parameter (maxEE
) inside the script. The maxEE
is simply these probabilities times the read lengths. Again, the first
value is for R1, the second for R2 reads.
Notice we have a larger threshold for the R2 reads here! In theory
both values should be the same (e.g. 0.01) but we have found this may be
too restrictive for R2 reads, hence the larger (0.02) value here. A
major difference between dada2
and VSEARCH
is
that here the R1 and R2 reads are filtered and denoised separately, and
towards the end of the procedure they are merged. Since R2 reads are
usually of poorer quality than R1 reads (and merged reads) this may
require a ‘softer’ quality threshold for the R2 reads in order to not
loose too many read pairs.
Note also that no primer-removal is done here now. This should now be
done in the de-multiplexing step, make certain you used the
midiv
R package version 2.2.0 or later when you
demultiplexed your data above.
The dada2
procedure is slow, and some steps take a very
long time. For this reason, we store many intermediate results here, in
the tmp.folder
, such that you may go back and re-compute
parts without repeating all steps.
The estimation of the parameters in the error models is very
time-consuming. The same is the case for the denoising itself. Thus, if
you should want to go back and look at the results of these steps in
more detail, the pipeline stores these intermediate results in the files
err.obj_R1.RData
and dada.obj_R1.RData
in the
tmp.folder
, and similar for R2 reads. If you want to plot
the results of the error modelling, you run the following R code:
load(file.path(tmp.folder, "err.obj_R1.RData"))
plotErrors(err.obj_R1)
load(file.path(tmp.folder, "err.obj_R2.RData"))
plotErrors(err.obj_R2)
where tmp.folder
has the content you specified in the
script. Thus, you should not immediately delete the
tmp.folder
in case you would like to inspect these
results.
The script PIPELINE_DADA2.R
may be run directly in
RStudio, but it is very slow and when on orion we recommend a standard
sbatch
to SLURM. Copy the shell script
/mnt/users/larssn/MIDIV/Amplicon_sequencing/Illumina/run_dada2.sh
to your working folder. It should look something like:
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=10
#SBATCH --mem=50G # Larger data sets may need more
#SBATCH --partition=smallmem,hugemem
#SBATCH --time=24:00:00 # Larger data set smay need more
#SBATCH --job-name=dada2
#SBATCH --output=dada2_%j.log
module load R/4.3.1
Rscript PIPELINE_DADA2.R
Note that in the script you set the number of threads. This should,
as always, correspond to the --ntasks
set in shell script
heading. Since dada2
is so slow, you may very well need to
extend the wall time to more than 24 hours, as set here. Also, the
memory allocation of 50GB may also be too small for very large data
sets.
If you run this on your own Windows PC (no shell script), the
multithreading does no longer work properly, according to the
dada2
manual. In this case, set
threads <- FALSE
instead of an integer. Needless to say, this runs even slower than on orion with 10 threads.
You now run the pipeline by
sbatch run_dada2.sh
which will in turn invoke the R script.
The output is again a set of centroids and a readcount table. There
is only one set of each. and there is no need to add either
OTU
or ZOTU
to the file names. The
dada2
people like to call the clusters ASV
(instead of OTU
or ZOTU
), but in our output
they are all named OTU
s. Again, this is to make life easier
when we come to the data analysis. An OTU
is an
OTU
is an OTU
regardless of procedure used
here!
Also, the metadata file is updated with some per-sample statistics as
for the other pipelines. This is also stored as
metadata.tbl.RData
in the tmp.folder
.
A third alternative for finding the OTU’s is a clustering with
SWARM
. This is also some kind of denoising, clustering
sequences that only differ i 1 base. To learn the actual theory, read
the published article.
The SWARM
software is not purpose made for clustering of
16S amplicon data, and for this reason we need to
VSEARCH
for the initial trimming/filtering/merging
of readsSWARM
to cluster sequencesHere is a short video illustrating the steps explained below.
Copy
/mnt/users/larssn/MIDIV/Amplicon_sequencing/Illumina/PIPELINE_SWARM.sh
to your work folder and edit. The Settings look like this:
##############
### Settings
###
swarm_app="apptainer exec /mnt/users/larssn/MIDIV/Amplicon_sequencing/apptainer/swarm:3.1.3--h9f5acd7_1.sif"
vsearch_app="apptainer exec /mnt/users/larssn/MIDIV/Amplicon_sequencing/apptainer/vsearch:2.28.1--h6a68c12_0.sif"
metadata_file=metadata_Tutorials.txt # The sample table
fastq_folder=SRA_ready # Folder with (input) fastq files
tmp_folder=tmp_swarm # Temporary output folder
threads=10 # Must match --ntasks in heading
trim_right_R1=20 # Trim off 3' ends before merging
trim_right_R2=60 # Trim off 3' ends before merging
max_error_probability=0.01 # Average error prob tolerated, P=0.01 corresponds to Q=20
min_size=2 # Minimum centroid copy number
module load R/4.3.3
Notice we need both VSEARCH
and SWARM
, both
as containers here.
The variables are the same as for the VSEARCH
pipeline
explained above.
Notice R is used a lot in this script, and you need the packages
tidyverse
, microseq
and magrittr
.
Make certain you have installed these.
The variable min_size
must be mentioned. This will now
set the minimum copy number threshold for any centroid, just as in
VSEARCH
. The SWARM
does not currently have
this implemented, and in this script we use R to ‘prune’ the centroids
for this requirement. The SWARM
still has a tendency of
producing too many clusters, and you may set the min_size
to larger values to lower this.
Note there is no primer-removal here now. This should now be part of
the de-multiplexing step above. Make certain you used midiv
R package 2.2.0 or later when you did the de-multiplexing.
First, the reads for all samples are trimmed, filtered, merged and
de-replicated just as for VSEARCH
. The final de-replication
is, however, different. The min_size
is not used, and all
reads, including singeltons, are kept. This is needed for the
SWARM
clustering.
Then the clusters are found using SWARM
, and this
produces a first version of the centroids_swarm.fasta
output file. This also outputs the file uclust.txt
in the
tmp_folder
. This is essential for pruning the centroids and
computing the read count table below.
The SWARM
tends to find a lot of clusters! We
have, however, found that clusters where the centroid is a singelton
should be discarded, regardless of how many members it has. This we have
currently implemented into the script as a chunk of R code. However, we
are in dialog with the makers of SWARM
in order to
implement this into the software itself. This would omit the need for
this R code, and speed things up considerably.
After pruning the centroids_swarm.fasta
, and chimera
detection using VSEARCH
is performed. This results in the
final version of centroids_swarm.fasta
.
Finally, the read count table is produced, again using a chunk of R
code. This outputs the readcount_swarm.tsv
file with the
read count table, similar to the other pipelines.
Also, the similar statistics on the number of reads at various steps
in the procedure are added to the metadata table, and stored in
metadata.tbl.RData
in the tmp_folder
as for
the other pipelines.
The output from all the pipelines above are similar:
.tsv
text file).The next step in processing these data is typical to do a taxonomic classification of the OTUs based on the centroid sequences. We have solutions for this, described in this document.