1 Introduction

Here are some ‘pipelines’ and recommendations concerning the initial processing of Illumina data from amplicon sequencing. By amplicon sequencing we refer to data where a marker gene has been amplified, and then sequenced, typically the 16S rRNA gene.

The scripts and data we refer to here can be copied from the folder /mnt/users/larssn/MIDIV/Amplicon_sequencing/Illumina/ when you are on orion.

You must always copy the code to your own folder first, and then you may edit. Think of the code here as a template, that you give your personal touch once you have your own copy.

It is also recommended that you copy all the apptainer containers you use (from /mnt/users/larssn/MIDIV/Amplicon_sequencing/apptainer/) to your own project folder, and edit all paths to use these local copies. In this way your code is reproducible forever after. The container versions in /mnt/users/larssn/MIDIV/Amplicon_sequencing/apptainer/ will change over time, and we do not store older versions.


2 Illumina paired-end de-multiplexing

From illumina paired-end sequencing the first step we do is typically a de-mutliplexing. At MiDiv we add some extra barcode sequences to the R1-reads in order to have more samples per sequencing run. De-multiplexing means we first split the reads into different files according to this barcode informatio.

Here is a short video illustrating the steps explained below.

2.1 The MIDIV_DEMULTIPLEX.sh script

This is the shell script you need. Move to the folder where you want to do the analysis, and copy this script to that folder. Here is how you do it on orion:

cp /mnt/users/larssn/MIDIV/Amplicon_sequencing/Illumina/MIDIV_DEMULTIPLEX.sh .

Open the script in RStudio. The first lines of code (after the SLURM heading and introductory text) should look something like this:

rawdata_folder="/mnt/smb/KBM_MIDIVlab/Rawdata/Amplicon_sequencing/Illumina/20220609_ampliconTutorial/"
metadata_file=$rawdata_folder/metadata_Tutorials.xlsx
fastq_folder=SRA_ready
tmp_folder=tmp
vsearch_app="apptainer exec /mnt/users/larssn/MIDIV/Amplicon_sequencing/apptainer/vsearch:2.28.1--h6a68c12_0.sif"
fastqc_app="apptainer exec /mnt/users/larssn/MIDIV/Amplicon_sequencing/apptainer/fastqc:0.12.1--hdfd78af_0.sif"
threads=10

The lines where you need to edit are those under the ### Settings in the script above. The rawdata_folder must contain a text with the proper path to the raw data folder with raw fastq files you are going to process. This is typically found under /mnt/smb/KBM_MIDIVlab/Rawdata/Amplicon_sequencing/Illumina/. The name of this folder is what we refer to as the SequencingRunID. In the template script above this is 20220609_ampliconTutorial, and you replace this with your folder name.

Inside this folder, there must be an excel-file with .xlsx extension in the metadata. This is a spreadsheet with one row for each sample, and a number of columns. We have a rather specific format for this file, see /mnt/smb/KBM_MIDIVlab/Rawdata/ for a template file with more information on this. In metadata_file you enter the name of this file, which is then expected to be found in the rawdata_folder.

The output will be in the folder you specify in fastq_folder. This folder will be created if it does not already exist. The tmp_folder is just used for some temporary output, and should be deleted by you after you are done.

Finally, it is specified how to run the software vsearch and fastqc on this system. Both are used at the end of this script, to make some assessment of the reads. The text in vsearch_app specifies how to run vsearch software on this system, and similar for fastqc_app. Here we use apptainer containers. If you have installed these software tools locally (on your own PC), you just replace this text by the path to their executables. You may omit the use of these tools, this is not part of the de-multiplexing itself, just comment out the corresponding parts at the end of this script.

The demultiplexing itself is done by some R code:

library(tidyverse)
library(readxl)
library(midiv)      
args <- commandArgs(trailingOnly = T)
metadata.file <- args[1]
rawdata.folder <- args[2]
fastq.folder <- args[3]
metadata.tbl <- read_excel(metadata.file) %>%
  mutate(filename = str_c(ProjectID, '_', SequencingRunID, '_', SampleID, '_R1.fastq.gz')) %>%
  mutate(filename2 = str_c(ProjectID, '_', SequencingRunID, '_', SampleID, '_R2.fastq.gz'))
dmltplx.tbl <- demultiplex(metadata.tbl,
                           in.folder = rawdata.folder,
                           out.folder = fastq.folder)
metadata.tbl <- full_join(metadata.tbl, dmltplx.tbl, by = c('ProjectID', 'SequencingRunID', 'SampleID'))
write_delim(metadata.tbl, delim = '\\t', file = str_replace(basename(metadata.file), 'xlsx', 'txt'))

In the script this is given as input to Rscript -e and with three input arguments specifying the metadata-file, the rawdata folder and the temporary folder.

Notice you need some R packages. The tidyverse, readxl and microseq you install from CRAN in the standard way. The last one, midiv, is our own small package. This is only found on GitHub, and you install it by running

if(!require(devtools)) install.packages("devtools")
devtools::install_github("larssnip/midiv")

directly in the Console window in RStudio. NB! This R package changes from time to time, make certain you have the latest version by repeating the above installation before you start.

NOTE 1: To read compressed fastq/fasta files (readFastq() and readFasta()) you also need the R-package R.utils found at CRAN.

NOTE 2: Make certain you have the midiv R package version 2.2.0 or later, since this is the version where primers are trimmed during de-multiplexing. We strongly recommend this, since this will make the resulting fastq-files ready for upload to SRA later (hence the name SRA_ready for the output folder).

2.2 Running the script

To load it into the SLURM queue you simply

sbatch MIDIV_DEMULTIPLEX.sh

in your Terminal and the job should appear listed in the queue under the name demultiplex.

The script is set up to use 10 threads and 50GB memory. You hardly need to change this, regardless of data set size, since more files will just take longer time, not more memory.

2.3 The output

The folder you specified as fastq_folder will be gradually filled up with new fastq-files. These will get names concatenating the

  • ProjectID
  • SequencingRunID
  • SampleID

separated by "_", for each sample in the metadata.file. There will be 2 fastq files for each sample (R1 and R2 reads).

In addition, the fastq_folder is also filled by reports from the fastqc software, one HTML-file for each fastq file. Inspect these to see the lengths and quality of the reads in the various de-multiplexed files. You may then delete these HTML-files.

After the de-multiplexing, the script will also output a plot, as a PDF-file in the work folder. This give you some information about reads and amplicons lengths you may need later. Here is an example:

Example of figure in the file trimming_guide.jpg
Example of figure in the file trimming_guide.jpg

The two upper panels show histograms of the length of the merged reads and the length of the overlapping regions. The lower panel shows how the 10% quantile of the read quality score distribute across the last bases of both R1 and R2 reads. You typically want to trim the 3’ ends of the reads. Here you can see how much you may trim (upper panels) and how much bad quality you remove by trimming. Note, you typically trim the R2 reads more than the R1 reads, since they typically have poorer quality.

Finally, the metadata table get a new column called n_readpairs, containing the number of read pairs for each sample. This table is then written to a local text file having the same name as the original, but replacing the extension .xlsx with .txt. This text file is what you use as input to the downstream pipelines.

2.4 Deprecated R script

In the folder /mnt/users/larssn/MIDIV/Amplicon_sequencing/Illumina/deprecated/ you find an R script version of the shell script above (MIDIV_DEMULTIPLEX.R). This is no longer updated on orion due to problems running apptainer from inside R (which is also an apptainer container). If you want to do the de-multiplexing on a Windows computer you may copy and edit this R script, but you are on your own! We make no guarantees on how it will work.




3 The VSEARCH pipeline

After de-mutliplexing we process the reads for each sample in order to

  • Find the unique variants of the 16S sequences we have sequenced. We refer to these as centroids.
  • Compute how many reads belong to each centroid in each sample. We refer to this as the readcount table.

There are several tools for doing this job. We first describe the VSEARCH tool, and then some alternatives to this in later sections. The VSEARCH software is open source and free, developed mainly by people in Oslo. You may install this on your own (Windows) computer as well, if you like. Below we focus on how to use this on orion.

Here is a short video illustrating the steps explained below.

3.1 The PIPELINE_VSEARCH.sh

The VSEARCH pipeline contains two slightly different procedures (identity clustering and denoising) for finding the centroids and their corresponding read counts, and both are used here, producing two sets of outputs.

The shell script you need is found in /mnt/users/larssn/MIDIV/Amplicon_sequencing/Illumina/PIPELINE_VSEARCH.sh. Copy this in the same way you copied the de-multiplexing script above. Open it in RStudio. After the SLURM heading and the introductory text, you should find the section

##############
### Settings
###
vsearch_app="apptainer exec /mnt/users/larssn/MIDIV/Amplicon_sequencing/apptainer/vsearch:2.28.1--h6a68c12_0.sif"
metadata_file=metadata_Tutorials.txt     # The sample table
fastq_folder=SRA_ready                   # Folder with (input) fastq files
tmp_folder=tmp_vsearch                   # Temporary output folder
threads=10                               # Must match --ntasks in heading
trim_right_R1=20                         # Trim off 3' ends before merging
trim_right_R2=60                         # Trim off 3' ends before merging
min_read_length=200                      # Minimum length of R1 or R2 read after trimming, discard those shorter
max_error_probability=0.01               # Average error prob tolerated, P=0.01 corresponds to Q=20 
min_size=2                               # Minimum copy number of centroid sequence
OTU_identity=0.97                        # identity threshold
unoise_alpha=2.0                         # UNOISE model parameter

This is where you may edit.

The vsearch_app is the path to the executable. Here we use an apptainer container on orion.

The metadata_file must show the path and name of the text version of the metadata. This should be produced the de-multiplexing script, see above. Also, the fastq_folder is the same as described for the de-multiplexing. This should now contain 2 fastq files for each sample.

The tmp_folder is just a folder to store temporary files during processing. This folder should be deleted in the end, when everything is done. It may be nice to inspect its content if something crashes, or to see that files are created/deleted during processing.

The threads is the number of threads used by the processors, and can safely be kept at 10 on orion.

The trim_right_R1 and trim_right_R2 sets how many bases to trim off the 3’ end of R1 and R2 reads before merging. During de-mutliplexing you should get the file trimming_guide.jpg From this you know approximately how much the R1 and R2 reads overlap. If the overlap is much longer than 25, you can afford to trim off some of the ends, and still have a long overlap. The reason we trim is that reads are always of poorest quality towards the end. By trimming away this, more reads tend to merge. Note that trimming too hard may result in many merged reads, but many badly merged reads. This is something you typcally edit!

The max_error_probability is a threshold for read quality filtering. Reads with average error probability above this are discarded. A quality score of 20 corresponds to error probability of 0.01, and this seems like a natural choice. Set this lower to filter harder or higher to filter weaker. This replaces the maxee parameter used in dada2 (which is read length dependent). You rarely edit this.

The min_size is the minimum number of copies a centroide sequence can have. This should be 2 or more. Singeltons (reads only seen once) cannot be trusted. In fact, the UNOISE procedure (see below) implicitly sets this to 8! Note that this is for finding the centroids only. All reads may be assigned to an OTU in the end. You rarely edit this.

The OTU_identity is primarily used for assigning reads in the end. Regardless of how the centroids were found (clustering at 0.97 or denoising) the final assignment of reads to centroids use this identity threshold. This you may very well edit.

The unoise_alpha is the \(\alpha\) parameter in the UNOISE denoising algorithm. This adjusts how ‘steep’ the copy number should drop for ‘children’ that belongs to a centroid ‘mother’ sequence. See the UNOISE paper for the details on this. You rarely edit this.

Edit this part of the shell script, and save it. Run it by:

sbatch PIPELINE_VSEARCH.sh

directly from the Terminal.

The very last step in PIPELINE_VSEARCH.sh is a call to R in order to update the metadata file. This requires the midiv R package again, see the section on MIDIV_DEMULTIPLEX.sh above.

3.2 More on the script content

First, the data for each sample is trimmed, merged and filtered. Note that no primer-removal is done here now! This is different from earlier versions. The primer-removal should now be part of the de-multiplexing step. The results are found in a fasta-files with the name using the SampleID for each sample, and the extension .fasta. These are found in the tmp_folder.

Next, all reads from all samples are collected in one big file (all.fasta), and then all unique sequences with minimum min_size copies are found (all_derep_minsize.fasta). These files are also in tmp_folder. All these files may be deleted once the pipeline is finished.

Next, the standard OTU clustering at 97% identity is done, including chimera filtering, as well as the UNOISE denoising algorithm.

Finally some R code is run to update the metadata table.

3.3 Output

From the 97% identity (or whatever identity you choose) clustering we get the two files:

  • centroids_vsearch_OTU.fasta
  • readcounts_vsearch_OTU.tsv

From the denoising, the corresponding files are named

  • centroids_vsearch_ZOTU.fasta
  • readcounts_vsearch_ZOTU.tsv

since the centroids are often referred to as ZOTUs in this case.

In addition, the metadata table is updated with some new columns:

  • The column n_vsearch_merged lists the number of merged reads for each sample
  • The column n_vsearch_reads_OTU is the final number of reads assigned to the 97% OTUs for each sample
  • The column n_vsearch_reads_ZOTU is the final number of reads assigned to the Z-OTUs for each sample

The updated metadata table is store as metadata.tbl.RData file in the tmp_folder. It does not overwrite the original metadata text file!

Note that in both the OTU and ZOTU result files, the OTUs inside are always named OTU1, OTU2,…etc regardless of which file you look into. Thus, even the ZOTU results use the term OTU. This is to make life easier when it comes to data analysis later. We always tag the OTUs/ZOTUs/ASVs or whatever people call them as OTUs!

3.4 Deprecated R scripts

In the folder /mnt/users/larssn/MIDIV/Amplicon_sequencing/Illumina/deprecated/ you find the R script VSEARCH_PIPELINE.R for running the VSEARCH pipeline from R. In case you want to run VSEARCH locally on a Windows computer, this is a way to proceed. We no longer update this, and it will probably not work properly unless you edit parts of it.

In the same folder you will also find the R script USEARCH_PIPELINE.R. This is very similar to the VSEARCH but uses the ‘original’ USEARCH software instead. Again, this is no longer updated on orion.




4 The DADA2 pipeline

The dada2 is a popular alternative for processing amplicon data. It runs entirely in R, and you need the dada2 R package installed. This you find at Bioconductor, see http://www.bioconductor.org/packages/release/bioc/html/dada2.html. Here is also the command needed to install this package:

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("dada2")

Run this directly in the Console window of RStudio. You should probably run the newest available version of R to make this pain free.

Copy the file /mnt/users/larssn/MIDIV/Amplicon_sequencing/Illumina/PIPELINE_DADA2.R to your working folder. This contains all the R code for running the dada2 pipeline on the same input as using VSEARCH above. On orion, you should also copy the shell script run_dada2.sh.

Here is a short video illustrating the steps explained below.

4.1 Settings

The settings for dada2 are slightly different from the VSEARCH above, but we have tried to make them as similar as possible:

###############
### Settings
###
metadata.file <- "metadata_Tutorials.txt"      # File produced by MIDIV_DEMULTIPLEX.sh
fastq.folder <- "SRA_ready"                    # Folder with de-multiplexed fastq-files, one pair for each sample
tmp.folder <- "tmp_dada2"                      # Folder for temporery output, delete this after you are done
trim.right <- c(20,60)                         # Trim off 3' end before merging
max.error.probability <- c(0.01, 0.02)         # Average error prob tolerated, P=0.01 corresponds to Q = 20 
threads <- 10                                  # Must be the same as --ntasks in shellscript.sh. In Window set to FALSE

The folders are as for VSEARCH.

The trim.right is here a vector with 2 values, for VSEARCH this were two separate variables. The first value is how much you trim the R1 reads, the second how much to trim the R2 reads.

The max.error.probability values you specify here is not used by dada2, but we ‘translate’ it into the corresponding parameter (maxEE) inside the script. The maxEE is simply these probabilities times the read lengths. Again, the first value is for R1, the second for R2 reads.

Notice we have a larger threshold for the R2 reads here! In theory both values should be the same (e.g. 0.01) but we have found this may be too restrictive for R2 reads, hence the larger (0.02) value here. A major difference between dada2 and VSEARCH is that here the R1 and R2 reads are filtered and denoised separately, and towards the end of the procedure they are merged. Since R2 reads are usually of poorer quality than R1 reads (and merged reads) this may require a ‘softer’ quality threshold for the R2 reads in order to not loose too many read pairs.

Note also that no primer-removal is done here now. This should now be done in the de-multiplexing step, make certain you used the midiv R package version 2.2.0 or later when you demultiplexed your data above.

4.2 Stored intermediate results

The dada2 procedure is slow, and some steps take a very long time. For this reason, we store many intermediate results here, in the tmp.folder, such that you may go back and re-compute parts without repeating all steps.

The estimation of the parameters in the error models is very time-consuming. The same is the case for the denoising itself. Thus, if you should want to go back and look at the results of these steps in more detail, the pipeline stores these intermediate results in the files err.obj_R1.RData and dada.obj_R1.RData in the tmp.folder, and similar for R2 reads. If you want to plot the results of the error modelling, you run the following R code:

load(file.path(tmp.folder, "err.obj_R1.RData"))
plotErrors(err.obj_R1)
load(file.path(tmp.folder, "err.obj_R2.RData"))
plotErrors(err.obj_R2)

where tmp.folder has the content you specified in the script. Thus, you should not immediately delete the tmp.folder in case you would like to inspect these results.

4.3 Running the script

The script PIPELINE_DADA2.R may be run directly in RStudio, but it is very slow and when on orion we recommend a standard sbatch to SLURM. Copy the shell script /mnt/users/larssn/MIDIV/Amplicon_sequencing/Illumina/run_dada2.sh to your working folder. It should look something like:

#!/bin/bash

#SBATCH --nodes=1
#SBATCH --ntasks=10
#SBATCH --mem=50G                      # Larger data sets may need more
#SBATCH --partition=smallmem,hugemem
#SBATCH --time=24:00:00                # Larger data set smay need more
#SBATCH --job-name=dada2
#SBATCH --output=dada2_%j.log


module load R/4.3.1
Rscript PIPELINE_DADA2.R

Note that in the script you set the number of threads. This should, as always, correspond to the --ntasks set in shell script heading. Since dada2 is so slow, you may very well need to extend the wall time to more than 24 hours, as set here. Also, the memory allocation of 50GB may also be too small for very large data sets.

If you run this on your own Windows PC (no shell script), the multithreading does no longer work properly, according to the dada2 manual. In this case, set

threads <- FALSE

instead of an integer. Needless to say, this runs even slower than on orion with 10 threads.

You now run the pipeline by

sbatch run_dada2.sh

which will in turn invoke the R script.

4.4 Output

The output is again a set of centroids and a readcount table. There is only one set of each. and there is no need to add either OTU or ZOTU to the file names. The dada2 people like to call the clusters ASV (instead of OTU or ZOTU), but in our output they are all named OTUs. Again, this is to make life easier when we come to the data analysis. An OTU is an OTU is an OTU regardless of procedure used here!

Also, the metadata file is updated with some per-sample statistics as for the other pipelines. This is also stored as metadata.tbl.RData in the tmp.folder.




5 The SWARM pipeline

A third alternative for finding the OTU’s is a clustering with SWARM. This is also some kind of denoising, clustering sequences that only differ i 1 base. To learn the actual theory, read the published article.

The SWARM software is not purpose made for clustering of 16S amplicon data, and for this reason we need to

  • Use VSEARCH for the initial trimming/filtering/merging of reads
  • Use SWARM to cluster sequences
  • Use R to prune the clusters and compute the read count table

Here is a short video illustrating the steps explained below.

5.1 The pipeline script

Copy /mnt/users/larssn/MIDIV/Amplicon_sequencing/Illumina/PIPELINE_SWARM.sh to your work folder and edit. The Settings look like this:

##############
### Settings
###
swarm_app="apptainer exec /mnt/users/larssn/MIDIV/Amplicon_sequencing/apptainer/swarm:3.1.3--h9f5acd7_1.sif"
vsearch_app="apptainer exec /mnt/users/larssn/MIDIV/Amplicon_sequencing/apptainer/vsearch:2.28.1--h6a68c12_0.sif"
metadata_file=metadata_Tutorials.txt           # The sample table
fastq_folder=SRA_ready                         # Folder with (input) fastq files
tmp_folder=tmp_swarm                           # Temporary output folder
threads=10                                     # Must match --ntasks in heading
trim_right_R1=20                               # Trim off 3' ends before merging
trim_right_R2=60                               # Trim off 3' ends before merging
max_error_probability=0.01                     # Average error prob tolerated, P=0.01 corresponds to Q=20 
min_size=2                                     # Minimum centroid copy number
module load R/4.3.3

Notice we need both VSEARCH and SWARM, both as containers here.

The variables are the same as for the VSEARCH pipeline explained above.

Notice R is used a lot in this script, and you need the packages tidyverse, microseq and magrittr. Make certain you have installed these.

The variable min_size must be mentioned. This will now set the minimum copy number threshold for any centroid, just as in VSEARCH. The SWARM does not currently have this implemented, and in this script we use R to ‘prune’ the centroids for this requirement. The SWARM still has a tendency of producing too many clusters, and you may set the min_size to larger values to lower this.

Note there is no primer-removal here now. This should now be part of the de-multiplexing step above. Make certain you used midiv R package 2.2.0 or later when you did the de-multiplexing.

5.2 The computations and output

First, the reads for all samples are trimmed, filtered, merged and de-replicated just as for VSEARCH. The final de-replication is, however, different. The min_size is not used, and all reads, including singeltons, are kept. This is needed for the SWARM clustering.

Then the clusters are found using SWARM, and this produces a first version of the centroids_swarm.fasta output file. This also outputs the file uclust.txt in the tmp_folder. This is essential for pruning the centroids and computing the read count table below.

The SWARM tends to find a lot of clusters! We have, however, found that clusters where the centroid is a singelton should be discarded, regardless of how many members it has. This we have currently implemented into the script as a chunk of R code. However, we are in dialog with the makers of SWARM in order to implement this into the software itself. This would omit the need for this R code, and speed things up considerably.

After pruning the centroids_swarm.fasta, and chimera detection using VSEARCH is performed. This results in the final version of centroids_swarm.fasta.

Finally, the read count table is produced, again using a chunk of R code. This outputs the readcount_swarm.tsv file with the read count table, similar to the other pipelines.

Also, the similar statistics on the number of reads at various steps in the procedure are added to the metadata table, and stored in metadata.tbl.RData in the tmp_folder as for the other pipelines.




6 Further processing and analysis

The output from all the pipelines above are similar:

  • A fasta file with centroids, i.e. the representative sequence for each OTU (same as ASV/ZOTU).
  • A table with read counts for each OTU in each sample (.tsv text file).

The next step in processing these data is typical to do a taxonomic classification of the OTUs based on the centroid sequences. We have solutions for this, described in this document.