1 Learning goals

  • Be familiar with the basic biases we face with metabarcoding data.
  • Learn how to process raw data to obtain a data structure for further analysis
    • using the vsearch software.
    • using the dada2 R package.
  • Understand the main differences between abundance clustering and denoising.
  • Learn how we can perform taxonomic classification of metabarcoding data.



1.1 Software used in this module





2 Metabarcoding

We are again faced with data from sequencing of a microbial community, but in this setting we do no longer sequence whole genomes. Instead, we design primers, and use PCR (Polymerase Chain Reaction) to amplify certain particular regions of the genomes, typically a marker gene. The most used marker is the 16S rRNA gene, which is found quite conserved in all prokaryotes, but other markers may be used. This is all down to which primers we use.

These amplified pieces from the genomes we call amplicons. Then we sequence these amplicons directly, since they are already quite short fragments. Thus, the fragments we sequence are not from any random position on the genomes, but from specific regions (genes) that we target by our primers. This leads to data of a different nature, and most processing and subsequent analysis differs slightly from what we have seen before.

Why do metabarcoding? The first answer is to estimate community composition. We did this also in Module 9, using tools like kraken2 and bracken on shotgun data. However, shotgun data is not always the best approach to estimate composition. Shotgun sequencing requires more resources, in terms of sequencing and data processing. It is in many cases overkill to do shotgun sequencing if we are only interested in community composition. We would rather spend our resources sequencing many samples instead of shotgun sequencing a few. Second, from shotgun data we only recognize taxa we have seen before, sometimes referred to as closed reference assignments. If we investigate new environments, and expect to detect organisms we have not seen before, the metabarcoding approach is usually better, since this allows open reference assignments (clustering). We can actually detect if we have detected something new!

If we have a marker gene that we know is found in all organisms, we try to sequence only this gene. If this gene is slightly different in different organisms, we may try to group or cluster them. The number of groups should reflect the number of organisms. The abundance of each organism should correspond to the number of reads we have from each group. And this does not rely on any recognition of previously seen taxa. For this reason it is typical that we first use metabarcoding when digging into ‘unknown terrain’, i.e. microbial communities where we expect to find a number of new an unknown organisms. After we have mapped who is there, we may do shotgun sequencing of the full genomes, to learn more about the genes of the most abundant organisms.



2.2 Biases

Let us consider amplicon sequencing of the 16S rRNA gene, which is by far the most common marker gene. This is how it ideally should look like:

We have some microbial community with \(M\) taxa, and composition \(C_1, C_2,...,C_M\). Each taxon has a distinct 16S sequence. Still, all 16S genes in all taxa have some shorter regions in common, where they are identical. We design primers that match these identical regions, and amplify the same pieces of the 16S gene from all taxa. Then we sequence without any sequencing errors. The reads we get should then be of \(M\) distinct variants, and the read count for variant \(i\) should be exactly the fraction \(C_i\) of the total read counts.

Well, the world is not perfect. In reality we have to deal errors at several steps. There are more or less random errors, like sequencing errors, but also biases, i.e. systematic deviations. Common biases are:

  • Primer bias. The short regions in the 16S gene that ‘all have in common’ do not really exist. No regions is identical across all taxa, even if some ‘universal’ primers will match a large majority. Thus, the primers will not amplify the 16S from all organisms, and some taxa contribute with no reads to our data set. They may be in the community, but are never picked up by our amplification+sequencing approach.
  • 16S copy number variations. Some organisms may have 1 copy of this gene in their genome while others may have up to 15-16. This means that some genomes contribute with may more reads than others even if they are equally abundant in terms of number of genomes.
  • PCR effects. Amplicon length and GC-content will affect the amplification effectivity, and thus affect how many reads we get. Another artefact is known as chimera formation. This results in the start of the 16S gene from one taxon is spliced with the end from another taxon, making the read a hybrid not representing any of the two.

It should be noted that biases like this may not always be a problem, even if they are present. The point is that if you want to compare the change in results due to some treatment etc, and the biases are the same in all cases, you may still make valid conclusions about changes taking place or not. We should, however, realize we have biased data, and be very careful about how widely we conclude.

2.3 Primers

Let us have a short look at some PCR primers used for 16S amplification. A primer pair are two shorter DNA sequences that will bind to denatured DNA, and make a PCR copy of the region between where they bind. Thus, primer pairs always consist of a forward and a reverse primer. For a primer pair to amplify something from a certain genome, the forward primer must match exactly at some part of the genome, the reverse primer must match exactly at another part rather close by, and the region between (and including) these two matching regions is amplified. The question is if both primers actually match something.

Here are two primer pairs used in a recent study of metabarcoding arctic environments: Comparison of Two 16S rRNA Primers (V3–V4 and V4–V5) for Studies of Arctic Microbial Communities. These are supposed to amplify two sligthly different regions of the 16S gene, as suggested by the names I gave them here:

V3V4 <- c("F" = "CCTACGGGNGGCWGCAG", "R" = "GACTACHVGGGTATCTAATCC")
V4V5 <- c("F" = "GTGYCAGCMGCCGCGGTAA", "R" = "CCGYCAATTYMTTTRAGTTT")

The F is the forward and R is the reverse primer in each pair.

Notice these primers contain not only A, C, G and T, but also what we call IUPAC ambiguity symbols, see https://www.bioinformatics.org/sms/iupac.html. Example, the symbol W means A or T. Thus, the primer sequences above actually represent several distinct sequences, and this make them more universal. In the microseq package there are functions to convert from IUPAC code to a regular expression, and back again. Use iupac2regex() on the primers to get the regular expression text you may search with later:

library(microseq)
V3V4_F.regex <- iupac2regex(V3V4[1])
print(V3V4_F.regex)
## [1] "CCTACGGG[ACGT]GGC[AT]GCAG"

How many distinct sequences is this actually representing? Well, we can see one position with 4 variants and one with 2, i.e. 4*2=8 different combinations.

The regular expression text above we can now use for searching in 16S DNA sequence for a match. To search with the reverse primer as well, we must first reverse-complement it and then use ìupac2regex()`. This is designed to match on the negative DNA strand, and as long as we have only the positive 16S sequence we must reverse-complement the reverse primers in order to see if they match:

V3V4_R.regex <- reverseComplement(V3V4[2]) %>% 
  iupac2regex()
print(V3V4_R.regex)
## [1] "GGATTAGATACCC[CGT][AGT]GTAGTC"

This one obviously has 9 combinations, and together the two primers have 8*9=72 combinations.

To see if the primer pair actually has a match inside some 16S gene, we now form a regular expression of these two:

V3V4.regex <- str_c(V3V4_F.regex, ".+?", V3V4_R.regex)
print(V3V4.regex)
## [1] "CCTACGGG[ACGT]GGC[AT]GCAG.+?GGATTAGATACCC[CGT][AGT]GTAGTC"

The ".+?" in the middle means ‘any symbols matched number of times until first occurrence of’. Thus, by using this text as a pattern in a text search we now look for region where we

  • Have a match to the forward primer regular expression…
  • Followed by any number of symbols…
  • Until the first match of reverse primer regular expression.

Here we allow any number of matches between the forward and reverse primer. In reality the PCR reaction will only work for rather short fragment, and we could have set an upper limit here. We are not interested in matches thousands of bases long. On the other end, we could also see matches where the reverse primer starts to match before the forward primer has ended its match. Such cases would amplify, but is not picked up by our search above. The primers above are designed to match inside the 16S gene, and both the cases of super-long or super-short amplicons are unlikely here, and we stick to our regular expression above.

A drawback with such a search is that it requires exact matches. We know that even if the 16S sequence has a mismatch or two to the primer, it will still amplify, but poorer. Thus, an exact search is an OK estimate of the efficiency of the primers.

2.3.1 Exercise - primer bias

Let us search with the two primer pairs from above in some 16S sequences to see

  • How often do they actually match?
  • How are these matches distributed among taxa?
  • How long are the amplified regions?

We need some 16S sequence to search in. The SILVA database is a public repository for Small Sub Unit (SSU) sequences that we may download. The “small sub unit” means the 16S gene in prokaryotes and 18S in eukaryotes. The full collection consists of around 2.2 million sequences. Working with this big file is slow. In the file COURSES/BIN310/module10/SILVA_138.1_SSURef_tax_silva_100K.fasta you find a random subset of 100 000 sequences we use here now.

Make an R script where you read in the SILVA sequences. The sequences are stored as RNA instead of DNA! This means an U in all places where DNA has a T. First, mutate() the Sequence column to replace "U" with "T" using str_replace_all(). Next, extract the superkingdom information from the Headercolumn and put into a new column. Next, filter out all eukaryotes from the data, i.e. remove entries where the superkingdom is "Eukaryota". Then compute the length of all sequences and make a histogram of the sequence lengths. How long are 16S sequences typically?

To search with both the primer-pairs from above among these sequences, use first the code above to create the regular expressions, and then use str_extract() to search in the SILVA sequences for matches, using one primer pair at the time. This will extracft the amplicon regions, or be NA if no matches were found. Store these extracted sequences (amplicons) in two new columns (a V3V4 and a V4V5 column) in the table you got from reading the fasta file above.

How large fraction of the total number of sequences did the primers match?

Split the matches between superkingdoms Bacteria and Archaea . How well do the primers match the Archaea?

Finally, make histograms of the length of the amplicons you got from the two primer pairs.

2.3.2 Exercise solution

library(tidyverse)
library(microseq)

# Code copied from above
V3V4 <- c("F" = "CCTACGGGNGGCWGCAG", "R" = "GACTACHVGGGTATCTAATCC")
V4V5 <- c("F" = "GTGYCAGCMGCCGCGGTAA", "R" = "CCGYCAATTYMTTTRAGTTT")
V3V4_F.regex <- iupac2regex(V3V4[1])
V3V4_R.regex <- reverseComplement(V3V4[2]) %>%
  iupac2regex()
V3V4.regex <- str_c(V3V4_F.regex, ".+?", V3V4_R.regex)

# Reading silva sequences, and plotting length histogram
silva.tbl <- readFasta("SILVA_138.1_SSURef_tax_silva_100K.fasta") %>%
  mutate(Sequence = str_replace_all(Sequence, "U", "T")) %>%
  mutate(superkingdom = word(Header, 2)) %>%
  mutate(superkingdom = word(superkingdom, 1, sep = ";")) %>%
  filter(superkingdom != "Eukaryota") %>%
  mutate(length = str_length(Sequence))
ggplot(silva.tbl) +
  geom_histogram(aes(x = length), binwidth = 10)

# The other primer pair regular expression
V4V5.regex <- str_c(iupac2regex(V4V5[1]), ".+?", iupac2regex(reverseComplement(V4V5[2])))

# Searching, and putting the resulting amplicons into new column
silva.tbl <- silva.tbl %>%
  mutate(V3V4 = str_extract(Sequence, V3V4.regex)) %>%
  mutate(V4V5 = str_extract(Sequence, V4V5.regex))
cat("In total the V3V4 primers give", sum(!is.na(silva.tbl$V3V4)), "matches out of", nrow(silva.tbl), "\n")
## In total the V3V4 primers give 78054 matches out of 92281
cat("In total the V4V5 primers give", sum(!is.na(silva.tbl$V4V5)), "matches out of", nrow(silva.tbl), "\n")
## In total the V4V5 primers give 82387 matches out of 92281
# Count the number of matches for each superkingdom
sum.tbl <- silva.tbl %>%
  group_by(superkingdom) %>%
  summarise(n = n(), V3V4_matches = sum(!is.na(V3V4)), V4V5_matches = sum(!is.na(V4V5)))
for(i in 1:nrow(sum.tbl)){
  cat(sum.tbl$superkingdom[i], ": V3V4 has", sum.tbl$V3V4_matches[i], "and V4V5 has",
      sum.tbl$V4V5_matches[i], "out of", sum.tbl$n[i], "possible matches\n")
}
## Archaea : V3V4 has 16 and V4V5 has 2791 out of 3167 possible matches
## Bacteria : V3V4 has 78038 and V4V5 has 79596 out of 89114 possible matches
# Computing the amplicon lengths, and plotting their histograms
silva.tbl %>%
  mutate(V3V4_length = str_length(V3V4)) %>%
  mutate(V4V5_length = str_length(V4V5)) %>%
  pivot_longer(cols = c(V3V4_length, V4V5_length), names_to = "primer", values_to = "amplicon_length") %>%
  ggplot() +
  geom_histogram(aes(x = amplicon_length, fill = primer), binwidth = 10) +
  xlim(300, 500)
## Warning: Removed 24160 rows containing non-finite values (`stat_bin()`).
## Warning: Removed 4 rows containing missing values (`geom_bar()`).

2.3.3 Exercise - 16S copy number

We know that different organisms have a different number of 16S sequences in their genomes. We may sometimes want to correct for this, since an organism with many 16S copies will have more reads in the metabarcoding data even if the organism is not more abundant in terms of number of genomes. Let us see how we get hold of this information.

The rrnDB is a public database where the copy number for 16S genes are being stored. In $COURSES/BIN310/module10/rrnDB-5.8.tsv you find the download file from the latest version of this database. It is, as the name suggests, a simple tab-delimited text file.

Read this into R and select the columns containing the NCBI tax id, the NCBI scientific name and the 16S copy numbers. Rename columns to tax_id, organism_name and n_16S, respectively. Keep only rows where all three columns have values (drop_na()).

In module 9 we explored the NCBI Taxonomy database a little. Try to add to the table above new columns listing the tax_id values for the ranks: species, genus, family, order, class, phylum and superkingdom for each entry. See in Module 9 how we used read_nodes_dmp() from the microclass R package to read the NCBI Taxonomy. Also, use the functions branch_retrieve() and branch_list2table() from the same package.

Compute the mean 16S copy number for all genera (hint: group_by(genus) and summarise()). What are the minimum, maximum and median of these mean values over all genera?

Here is a code skeleton to start out with:

library(tidyverse)
library(microclass)

# Reading the database file, selecting and discarding incomplete entries
rrndb.tbl <- read_delim(___, delim = "\t") %>% 
  select(___ = `NCBI tax id`, organism_name = ___, n_16S = ___) %>% 
  drop_na()

# The taxonomy of all unique tax_id's in the database
tax.tbl <- tibble(tax_id = unique(___))
nodes.tbl <- ___("/mnt/courses/BIN310/module9/NCBI/nodes.dmp")
tax.tbl <- branch_retrieve(___, ___) %>% 
  branch_list2table() %>% 
  bind_cols(tax.tbl)

# Now we join in all the taxonomy results
rrndb.tbl <- rrndb.tbl %>% 
  left_join(___, by = "tax_id")

# Compute some statistics for genera
genera.tbl <- ___ %>% 
  group_by(___) %>% 
  summarise(___ = mean(___))

2.3.4 Exercise solution

library(tidyverse)
library(microclass)

# Reading the database file, selecting and discarding incomplete entries
rrndb.tbl <- read_delim("/mnt/courses/BIN310/module10/rrnDB-5.8.tsv", delim = "\t") %>%
  select(tax_id = `NCBI tax id`, organism_name = `NCBI scientific name`, n_16S = `16S gene count`) %>%
  drop_na()

# The taxonomy of all unique tax_id's in the database
tax.tbl <- tibble(tax_id = unique(rrndb.tbl$tax_id))
nodes.tbl <- read_nodes_dmp("/mnt/courses/BIN310/module9/NCBI/nodes.dmp")
tax.tbl <- branch_retrieve(tax.tbl$tax_id, nodes.tbl) %>%
  branch_list2table() %>%
  bind_cols(tax.tbl)

# Now we join in all the taxonomy results
rrndb.tbl <- rrndb.tbl %>%
  left_join(tax.tbl, by = "tax_id")

# Compute some statistics for genera
genera.tbl <- rrndb.tbl %>%
  group_by(genus) %>%
  summarise(n_16S = mean(n_16S))
cat("Smallest mean copy number across genera =", min(genera.tbl$n_16S), "\n")
cat("Median mean copy number across genera =", median(genera.tbl$n_16S), "\n")
cat("Maximum copy number across genera =", max(genera.tbl$n_16S), "\n")





3 Metabarcoding data processing

Below we will have a look at two procedures for processing raw metabarcoding data. In both cases we start out with a set of fastq files with metabarcoding reads. From these we try to find the representative sequences, i.e. those who represent the actual organisms in our microbial community. Since reads have some error in them, this is not straightforward, and different approaches are used. Each representative sequence forms what we denote an Operational Taxonomic Unit (OTU). Next, the remaining reads become members of some OTU, and the read count for each OTU indicates its abundance in the microbial community. This grouping usually involves several steps of raw read processing, and the grouping itself is done in different ways as we will see.

Once we have the groups or OTUs, we also like to do a taxonomic classification, to have some taxa names attached to the various groups, if possible. However, the grouping is the main thing, taxonomy assignment is not really required for many downstream analyses.

Why do we group the reads, and then assign to taxa? Why not assign reads to taxa directly? We have seen tools like kraken2 that would do this in no time. The short answer is, we can! If we study a community where all involved taxa are to be found in our database, there is really no good argument for doing the detour of a grouping. In such cases many will argue it is better to use a (shallow) shotgun sequencing, to avoid the biases involved in amplicon sequencing. The point is that in many (most?) cases our database does not contain all taxa. Amplicon studies are best suited for investigating ‘new’ communities. Then, each new OTU is potentially a new organism, even if it does not match anything in our taxonomy database.

We will look at a couple of tools for amplicon data processing. The first, vsearch, represents the ‘classical’ OTU-clustering and there are several tools out there doing more or less the same thing. The second, dada2 represents a denoising idea and group all reads whose differences are due to sequencing errors only. There are several other tools in use, but we do not have time to explore them all.

3.1 Demultiplexing

To save money, it is common to sequence several samples in one sequencing run. This is called multiplexing. A sequencing run may produce a huge number of reads, and it is overkill to sequence this much from every sample. Prior to sequencing, a shortish DNA sequence denoted a barcode, is added to all fragments (amplicons). This barcode is identical for all fragments from the same sample, but distinctly different from the barcodes in other samples.

The raw reads from the sequencing machine are all mixed in one huge pool, and in order to sort out which reads are from which sample, we first group them by their barcode. This barcode sequences are usually at each end of the reads, e.g. the first or last (or both) 8 bases. Sorting reads into groups by their barcode sequence is called de-mutliplexing.

We will not do this step here. Often this is done automatically in the sequencing process, and the fastq-files we get are already de-multiplexed. If not, it is simply a matter of having a list of all barcodes, searching for them among all the reads, and writing the reads matching the same barcodes to the same fastq-file. There are tool for this purpose, but it is actually pretty straightforward to write your own (R-)code for doing this. Since the barcodes are usually rather short we must require exact matching, and searching for exact matches is simple in all programming languages.

3.2 Data organization

Before we start computing, let us say a few words about data organization.

Metabarcoding data typically involve data from many samples. We will assume there is one pair of fastq-files for each sample (after de-multiplexing). Along with these files there should always be a file with a table, where the samples are listed in the rows. This table should typically contain a column with a short text identifying each sample, columns with the file names for each sample, plus additional columns with extra information about each sample. If samples are from various locations, time points, treatments etc, this is listed in these additional columns. We often refer to the latter as sample data or meta data.

Here is a very short example, with 2 samples only:

sample_id R1_file R2_file Gut_part Hive Month
Mar.K2.Ile2 Mar_K2_Ile2_R1.fastq.gz Mar_K2_Ile2_R2.fastq.gz Ileum 2 March
Jun.K5.Rec5 Jun_K5_Rec5_R1.fastq.gz Jun_K5_Rec5_R2.fastq.gz Rectum 5 June

The four three columns is typically something we would always need, i.e. some sample_id with a text that uniquely identifies each sample, and then the names of the two fastq-files (assuming we have paired-end reads) associated with each sample. The remaining columns we may make use of later when we come to the analysis part. In this course we will refer to such a table as the sample.tbl. This table will accompany the processing and subsequent analysis of the metabarcoding data. You may prepare this file in a spreadsheet like Excel, store it as an .xlsx file, and read into R using the read_excel() function from the readxl package (see exercise below).

As we go along with the reads processing we will compute two additional tables. The most central one is a table of read count values for each OTU from each sample. Finally, additional information about each OTU is stored in a third table. This could be the representative sequence behind each cluster (Header and Sequence), their taxonomic assignments etc. We may summarise this as

  • We start out with a sample table. This has one row for each sample, and a number of columns with information about each sample. This is something we prepare before we start the processing below.
  • The processing give us a sequence table, a table with one row for each OTU, where we keep the representative sequences, their assigned taxonomy and other stuff we may compute.
  • The processing also give us a read count table containing only numbers. This typically has one row for each OTU and one column for each sample, or vice versa.

3.2.1 Exercise - a sample table

In /mnt/courses/BIN310/module10/ you find the Excel-file bin310_samples.xlsx. Make an R script in your module10 folder, where you

  • Read the excel file. Use the read_excel() function from the readxl R package. Store it in an object named sample.tbl.
  • Inspect the table. Notice the column names, and what they contain. Also not how many samples we have.
  • Write the table to a text file in your module10 folder. Use write_delim() and use tab ("\t") as the delimiter between columns. Name the file sample_table.txt.

We will make use of this file below.

3.2.2 Exercise solution

library(tidyverse)
library(readxl)

sample.tbl <- read_excel("/mnt/courses/BIN310/module10/bin310_samples.xlsx")
View(sample.tbl)
write_delim(sample.tbl, delim = "\t", file = "sample_table.txt")





4 Pulling docker containers

We have so far copied all our container software from the website [https://depot.galaxyproject.org/singularity/](https://depot.galaxyproject.org/singularity/]{target=“blank”}. What if we look for some software that is not found there? We have previously seen how we may use conda as an alternative. Let us now consider another alternative.

We want a container for the software vsearch. If you inspect the galaxy website, you will find such containers, but what could we do if this was not the case?

In addition to apptainer you also find another commonly used approach for building containers, namely docker. The reason we do not use docker in BIN310 is that it requires administrator permissions, and this we do not have as ordinary users on an HPC. However, containers built for/by docker also works with apptainer!

Here is a short video showing the steps described below.

To find containers for docker look up the dockerhub website. Here you search for the software you look for. Often you get more than one hit, since different developers may have supplied containers for the same (or similar) software. You can sort the list by most recent updates to get the newest versions. You still may have to decide which docker container to go for. In the case of vsearch we select the one supplied by the user torognes (which we happen to know is the main developer behind vsearch, see the GitHub site for this software). Next, click the Tags tab, and you should get listed a number of versions. The first one in the latest, and we copy the Docker Pull Command from this (right box).

Then we open a Terminal and paste in the copied command. It should look something like this:

docker pull torognes/vsearch

We must now edit this command to use apptainer instead. This means

  • Replace docker by apptainer
  • Add docker:// to the container address

It should look like

apptainer pull docker://torognes/vsearch

Just return, and watch the container being pulled, i.e. it is downloaded and built inside the folder you are in. After it has finished, you should now have a .sif file there with the container! This should now work in the same way as the containers we have downloaded from galaxy.

Beware that containers at dockerhub may be old or not contain exactly what the name suggests. I would in most cases prefer to copy from galaxy, but at least we now know of an alternative route to obtaining the software we need.





5 The vsearch read processing

The vsearch was originally an opensource version of the software usearch, but has now several additional facilities. It has been implemented to run extremely fast, making use of parallel processing in a smart way. It offers a long range of options, where a typical command line start with vsearch followed by a range of sub-commands and options. As usual you may execute it in the Terminal to read its help text. Also, see the github site linked to at the top of this document.

5.1 Part 1 - from fastq to fasta files

The processing of the raw data involves a number of steps, and we choose will split this into two separate scripts here now. The first script takes us from the raw fastq files to a set of fasta files, one for each sample. Here are the steps we will take:

  1. Read the sample table. Then, for each sample
  2. Merge read-pairs
  3. Filter reads based on quality scores
  4. De-replication

Since we perform the exact same steps for each sample, we should now repeat the code we are using, either by a for-loop or by using array jobs. This time we choose the former. Why? In the processing steps, we create some temporary files. If we use array jobs, then these temporary files must have unique names, or be stored in unique folders, in order to prevent one array job to overwrite the temporary files of another while both are working. If we process everything sequentially (looping) we process one sample at the time, and overwriting the files from the previous sample is not a problem as long as the final results are in unique files. Thus, with array jobs we could do the processing faster, but get more temporary files to handle. Since the processing of a single sample is quite fast anyway with vsearch, we settle for the slower solution using a loop and avoid the hassle with unique temporary files. In cases where you have super-many samples, you should perhaps consider using array jobs after all.

We also need to read the text file containing the sample table. We produced such a text file in an exercise above, and you should have the file sample_table.txt inside your module10 folder. In module 7 we processed many genomes instead of many samples, but then we also had a file where each row contained the information we used. We make a similar solution here, using awk to read in such files.

You should have inspected the sample table, and find the R1 and R2 file names do not have a path. Where are they located? You find them all in $COURSES/BIN310/module10/fastq/.

Here is a shell script where we only process the first sample:

#!/bin/bash

#SBATCH --nodes=1
#SBATCH --reservation=BIN310
#SBATCH --account=bin310
#SBATCH --ntasks=10
#SBATCH --mem=30G
#SBATCH --time=01:00:00
#SBATCH --job-name=vsearch1
#SBATCH --output=vsearch1_%j.log

#################
### Settings
###
sample_table_file=sample_table.txt
data_folder=$COURSES/BIN310/module10/fastq
vsearch_exe="apptainer exec $HOME/module10/vsearch_latest.sif vsearch"
tmp_folder=$SCRATCH/vsearch
max_error_probability=0.01
threads=10
if [ ! -d $tmp_folder ]
then
  mkdir $tmp_folder
fi

#############################################
### Using awk to read the sample_table.txt
###
sample_id_column=$(awk -F"\t" 'NR==1{for(i=1;i<=NF;i++){f[$i] = i}}{print $(f["sample_id"])}' $sample_table_file)
R1_file_column=$(awk -F"\t" 'NR==1{for(i=1;i<=NF;i++){f[$i] = i}}{print $(f["R1_file"])}' $sample_table_file)
R2_file_column=$(awk -F"\t" 'NR==1{for(i=1;i<=NF;i++){f[$i] = i}}{print $(f["R2_file"])}' $sample_table_file)
n_rows=$(wc $sample_table_file | awk -vj=1 '{print $j}')

#################################
### Processing reads
###
for i in $(seq 2 2)
do
  sample_id=$(echo $sample_id_column | awk -vj=$i '{print $j}')
  R1_file=$(echo $R1_file_column | awk -vj=$i '{print $j}')
  R2_file=$(echo $R2_file_column | awk -vj=$i '{print $j}')
  
  echo "*************************************************************************"
  echo "VSEARCH merging R1 and R2 reads..."
  $vsearch_exe \
    --fastq_mergepairs $data_folder/$R1_file \
    --reverse $data_folder/$R2_file \
    --fastqout $tmp_folder/merged.fq

  echo "*************************************************************************"
  echo "VSEARCH quality filtering..."
  $vsearch_exe \
    --fastq_filter $tmp_folder/merged.fq \
    --fastq_maxee_rate $max_error_probability \
    --fasta_width 0 \
    --fastaout $tmp_folder/filtered.fa
  rm $tmp_folder/merged.fq

  echo "*************************************************************************"
  echo "VSEARCH dereplicating..."
  $vsearch_exe \
    --derep_fulllength $tmp_folder/filtered.fa \
    --minuniquesize 1 \
    --strand plus \
    --sizeout \
    --relabel $sample_id:unique_read \
    --fasta_width 0 \
    --output $tmp_folder/$sample_id.fa
  rm $tmp_folder/filtered.fa
done

Under Settings we set the name of the file having the sample table, the folder where the reads files are located, and also how to execute vsearch using the container we pulled from dockerhub. Since we start vsearch many times in our code, it is convenient to replace the long command line with a short $vsearch_exe.

We also specify a folder for temporary output. All output from this script is actually temporary in the sense it is only used as input in part 2 below. The max_error_probability, given the value 0.01, we will come back to below.

We then read the file $sample_table_file three times, and collect the columns we need. Inside the loop, we collect the exact piece of information we need from these columns. Note the loop starts at 2, not 1. This is because when reading with awk the column headers is the first row and what we usually think of as row i is actually row i+1 in this case.

The first step where we use vsearch is to merge read pairs. The two reads in a pair come from each end of the genomic fragment we sequence, as always. These fragments are now amplicons that we have copied out of the 16S gene. By designing the primers to match at various locations, we get amplicons of certain lengths. We saw this above in an exercise where we did an in silico amplification. These amplicons are typically designed to be less than two times the read lengths, i.e. when we sequence from each end there is a region in the middle where the two reads overlap. Due to this, we can align the two reads, and merge them into one longer read spanning the entire amplicon. Notice that the output from this step is a fastq-file, i.e. we still have quality scores for the merged sequence.

In the next step we filter the reads based on the quality scores. The option --fastq_maxee_rate makes use of the $max_error_probability threshold. We know from before how to convert quality scores to error probabilities (module 2). Having this probability for each base, we can compute the average of these, and if this is above the $max_error_probability we discard the read. Note that an error probability of 0.01 corresponds to a quality score of 20. You may want to increase this if too many reads are discarded. Note that after filtering we store the reads in fasta format, since we no longer need the quality scores.

Finally, reads are de-replicated. This simply means that instead of storing many copies of the exact same read, we just store one copy, and then add the size information to its Header line (fasta format). We also use --relabel to add the $sample_id text to the Header line of each read. This is very important! Later we mix all reads from all samples, and then we need this information to follow every read in order to keep track of which read came from which sample.

Earlier in BIN310 we saw data from shotgun sequencing. Then, it is almost never two reads who are identical, and a de-replication would seem like a waste of time. But, for metabarcoding data, all reads come from the same region of the same gene, and for this reason reads are very often identical.



5.2 Some words about merging of paired end reads

We know that the R1- and R2-reads are listed as pairs in the fastq files, i.e. read number 1 in both files belong together, etc. When we design the primers we use to amplify our amplicons, we must test how long the amplicons will typically be. In the case of the data we use here, the amplicons are mostly in the range 410-420 bases. Then we sequence with reads of length 250 from each side (Illumina MiSeq). This means we should have an overlap region of at least 80 bases (250 + 250 - 420). Thus, the last 80 bases of the R1-read should be the same sequence as the reverse-complement of the last 80 bases of the R2-read.

Figure 1. Paired-end reads of amplicons will overlap.
Figure 1. Paired-end reads of amplicons will overlap.

We also know there are sequencing errors, and we cannot expect these overlapping sequences to be identical each time. We also know that sequence quality is poorer towards the end of both reads, exactly where we try to match them! Thus, in some cases the merging will be done in a wrong way, and in some cases not at all. These reads are then lost from the downstream analysis. There are many options you may give to the fastq_mergepairs subcommand, in order to tune the merging. When you run script above, and inspect the log-file, you will find vsearch will give us some output on how this turned out. Here is a small example:

Merging reads 100%
     28331  Pairs
     25734  Merged (90.8%)
      2597  Not merged (9.2%)

Pairs that failed merging due to various reasons:
        27  multiple potential alignments
      2543  too many differences
        27  alignment score too low, or score drop to high

Statistics of all reads:
    250.00  Mean read length

Statistics of merged reads:
    410.83  Mean fragment length
      1.67  Standard deviation of fragment length
      0.99  Mean expected error in forward sequences
      3.84  Mean expected error in reverse sequences
      1.88  Mean expected error in merged sequences
      0.57  Mean observed errors in merged region of forward sequences
      2.17  Mean observed errors in merged region of reverse sequences
      2.74  Mean observed errors in merged region

We see that around 90% of the read pairs merged here, and that lack of merging was mainly due to too many differences in the overlap region. One obvious way to improve merging may be to trim the reads at the 3’ end, ie. simply trim the reads at their ‘bad’ end. Since we have plenty of bases overlap here, we can afford to do this.

Important! Before you trim, make certain you know approximately the length of the overlap region. If you trim too hard, this becomes too short, and nothing merges!

When we merged the R1 and R2 reads, how does the quality sequence look like? In the overlapping region there must be some mix of the quality sequences from both? We will not dig into this here, but I recommend reading this article if you want to learn more about this.

When we look at the dada2 software below, we will return to this merging, since this is perhaps where vsearch and dada2 differs the most.

5.2.1 Exercise - inspect the processed reads for the first sample

Make the shell script using the code above, and sbatch. You should get a fasta file with the processed reads for the first sample in sample_table.txt.

Make an R script where you read in this fasta file, and find answers to the following questions:

  • How many unique reads were found in this sample?
  • What is the largest copy number (size) of any read?
  • How many reads in total survived the processing?
  • Make a histogram of the unique read lengths. What is a typical amplicon length in this case?

5.2.2 Exercise solution

library(tidyverse)
library(microseq)
first.tbl <- readFasta("/mnt/SCRATCH/larssn/vsearch/sample_1A.fa") %>%   # edit path
  mutate(length = str_length(Sequence)) %>%
  mutate(size = as.numeric(word(Header, -1, sep = "="))) %>%
  arrange(desc(size))
cat("There are", nrow(first.tbl), "unique reads in this sample\n")
cat("The largest copy number is", first.tbl$size[1], "\n")
cat("The total number of reads is", sum(first.tbl$size), "\n")
fig <- ggplot(first.tbl) +
  geom_histogram(aes(x = length), binwidth = 1) +
  labs(x = "Amplicon length", y = "Number of unique reads")
print(fig)
cat("The average read length is", sum(first.tbl$length * first.tbl$size) / sum(first.tbl$size), "\n")

Note that the ‘typical’ amplicon length is the same as the ‘typical’ read length after merging of the read pairs. Also, note that to get a correct average read length you cannot in general just take the mean length of the unique reads! You have to weight this with how often each read occurs, i.e. the size. Imagine you have 12 children and measure their length. One is 100, one is 110 and ten of them are 120. There are 3 unique lengths (100, 110, 120), but the mean of all 12 children is not 110! The length 120 has ‘size’ 10 while the other two have ‘size’ 1…

5.2.3 Exercise - trimming reads

Add code to the shell script from above, to trim both R1 and R2 reads by 10 bases at their 3’ ends. See the vsearch help on Trimming and filtering for the correct subcommand and options to do this. NB! You need to trim each fastq file separately. Note that this means you need to store the output as two new fastq files, typically in the out_folder folder.

Why would we trim the 3’ end of the reads like this? We will discuss this below, after we have used the dada2 to also process the reads.

5.2.4 Exercise solution

Here we only show the code that is added/changed, the rest is above:

#################
### Settings
###
trim_right_R1=10
trim_right_R2=10

#################################
### Processing reads
###
for i in $(seq 2 2)
do
  sample_id=$(echo $sample_id_column | awk -vj=$i '{print $j}')
  R1_file=$(echo $R1_file_column | awk -vj=$i '{print $j}')
  R2_file=$(echo $R2_file_column | awk -vj=$i '{print $j}')

  # The new code for trimming at 3' end (right end)...
  echo "*************************************************************************"
  echo "VSEARCH trimming R1 reads before merging..."
  $vsearch_exe \
    --fastq_filter $data_folder/$R1_file \
    --fastq_stripright $trim_right_R1 \
    --fastq_minlen 0 \
    --fastqout $tmp_folder/trimmed_R1.fq

  echo "*************************************************************************"
  echo "VSEARCH trimming R2 reads before merging..."
  $vsearch_exe \
    --fastq_filter $data_folder/$R2_file \
    --fastq_stripright $trim_right_R2 \
    --fastq_minlen 0 \
    --fastqout $tmp_folder/trimmed_R2.fq

  # And we need to change the merging slightly, now using the trimmed_R1.fa and trimmed_R2.fq as input
  echo "*************************************************************************"
  echo "VSEARCH merging R1 and R2 reads..."
  $vsearch_exe \
    --fastq_mergepairs $tmp_folder/trimmed_R1.fq \
    --reverse $tmp_folder/trimmed_R2.fq \
    --fastqout $tmp_folder/merged.fq
  rm $tmp_folder/trimmed_R1.fq
  rm $tmp_folder/trimmed_R2.fq

  #...rest as before
done

5.2.5 Exercise - process all samples

Finally, extend the last script by processing all samples instead of only the first. You should then have a fasta file with processed reads for each sample listed in the sample table.

5.2.6 Exercise solution

Well, this is just a matter of extending the for-loop…

for i in $(seq 2 $n_rows)

Notice that we already have the shell variable n_rows from earlier in the script. This contains the number of rows (lines) in the sample table file (note the shell code we used to read this).

The final part 1 script is the one from the exercises, i.e. where we include the trimming of the 3’ end, and loop over all samples. You need to run this in order to have the files we will work more on in the next section.



5.3 Part 2 - sequence clustering and read counts

We now continue with the fasta files we produced for each sample above.

The following code ypu may either add to the previous one, to get it all in one single (big) shellscript, or you may split them and make two separate scripts that you run in sequence. Here we use the latter solution, having a separate shell script for part 2. Here is a template for this second script:

#!/bin/bash

#SBATCH --nodes=1
#SBATCH --reservation=BIN310
#SBATCH --account=bin310
#SBATCH --ntasks=10
#SBATCH --mem=30G
#SBATCH --time=01:00:00
#SBATCH --job-name=vsearch2
#SBATCH --output=vsearch2_%j.log

##############
### Settings
###
min_size=2                           # Minimum copy number of centroid sequence
OTU_identity=0.97                    # identity threshold
tmp_folder=$SCRATCH/vsearch          # assume fasta files from part 1 is in here
vsearch_exe="apptainer exec $HOME/module10/vsearch_latest.sif vsearch"
out_folder=$HOME/module10/vsearch    # final results end here
threads=10
if [ ! -d $out_folder ]
then
  mkdir $out_folder
fi

echo "*************************************************************************"
echo "Reads from all samples into one file..."
if [ -f $tmp_folder/all.fasta ]
then
  rm $tmp_folder/all.fasta
fi
cat $tmp_folder/*.fa >> $tmp_folder/all.fasta

echo "*************************************************************************"
echo "VSEARCH dereplicating all sequences..."
$vsearch_exe \
 --derep_fulllength $tmp_folder/all.fasta \
 --minuniquesize $min_size \
 --strand plus \
 --sizein \
 --sizeout \
 --relabel_sha1 \
 --fasta_width 0 \
 --output $tmp_folder/all_derep_minsize.fasta

echo "*************************************************************************"
echo "VSEARCH OTU clustering..."
$vsearch_exe \
 --cluster_size $tmp_folder/all_derep_minsize.fasta \
 --id $OTU_identity \
 --strand plus \
 --sizein \
 --sizeout \
 --relabel OTU \
 --threads $threads \
 --centroids $out_folder/vsearch_centroids.fasta

echo "*************************************************************************"
echo "VSEARCH chimera filtering..."
$vsearch_exe \
 --uchime_denovo $out_folder/vsearch_centroids.fasta \
 --sizein \
 --nonchimeras $tmp_folder/nonchimeras.fasta
mv $tmp_folder/nonchimeras.fasta $out_folder/vsearch_centroids.fasta

echo "*************************************************************************"
echo "VSEARCH assigning reads to OTU's..."
$vsearch_exe \
 --usearch_global $tmp_folder/all.fasta \
 --db $out_folder/vsearch_centroids.fasta \
 --id $OTU_identity \
 --strand plus \
 --sizein \
 --threads $threads \
 --otutabout $out_folder/vsearch_readcounts.txt

After the settings we start by writing all reads from all samples to one large fasta file. Reads from all samples should be considered in the subsequent clustering. The same organisms may be seen in several samples, and using all reads give us higher read counts for each OTU, and this makes the clustering easier. Needless to say, this file may become huge if we have many samples! Luckily vsearch is fast, but this may be a memory limitation if you run it on your local computer.

Next step is de-replication of all reads. Here we set the minimum copy number to 2 (the min_size variable). The idea is that reads we only see once in this huge set of reads, most likely contain sequencing errors. We do not discard them, but these reads should never be allowed to form a separate cluster. They may later be assigned to some cluster, though. You may increase min_size to even larger values.

The next step is the clustering of all de-replicated sequences. This is the crucial step. Here we decide which reads belong together and forms the abundance of each cluster. We typically refer to such clusters as Operational Taxonomic Units (OTU). In principle we think of them all as taxa, but we may not have a name for them all, and OTU indicates this. The identity sets the size of all clusters. The centroid sequence of a cluster is the sequence with largest copy number in the cluster. This is the representative sequence for that cluster. Then, all other members must have identity above identity to this centroid sequence. The output from the clustering is a fasta file with these centroid sequences (here named vsearch_centroids.fasta).

Then we do a chimera filtering. What is a chimera? This is an artefact of the PCR amplification process we do before sequencing. During this process you may get amplicons who are mxitures of two original amplicons, i.e. the first part is from one organism and the second apart is from another. Such reads should be discarded. Here we output the non-chimera sequences to a temporary file, and then copy this to overwrite the file vsearch_centroids.fasta. The latter should now be without chimeras.

Finally, we assign all reads to the OTUs we found above. We use the fasta file with all reads (all_reads.fasta) that we created above, i.e. all reads are now being considered, also those occurring only once, or those who are chimeras. The point is that each read is compared to the centroids, and if it is more than identity similar to some centroid, it becomes member of that OTU. In this step we now rely on each read having the $sample_id text in its Header line. Thus, the line --relabel $sample_id:unique_read in the last part of the first script above is crucial! Without this, it would now be impossible to know from which sample the various reads come from. For this reason, you cannot use the de-replicated reads in this assignment, since then this information is lost.

The output from the assignment is a table of read counts. The first column lists all OTUs, i.e. the text you find in the Header lines of the centroids, and then there is one column for each sample. The numbers are read counts, i.e. how many reads each OTU has in each sample. Each row is an OTU, each column a sample. This is saved in the file vsearch_readcounts.txt as a tab-delimited text file.

5.3.1 Exercise - the sequence table

Read the file centroids.fasta into R and name the object sequence.tbl. Add a new column clustered_size and fill it with the size numbers you find inside the Header texts (size=xxx). Then, remove this size-information from the Headers, leaving only the OTU_xxx. Sort the table by clustered_size, largest values first. For this particular data set we know there are exactly 27 species in the microbial community. How does this compare to the numbers of clusters/OTUs/taxa we got here? What might be the reasons for any discrepancy?

5.3.2 Exercise solution

library(tidyverse)
library(microseq)

sequence.tbl <- readFasta("~/module10/vsearch/vsearch_centroids.fasta") %>%   # edit path"
  mutate(clustered_size = str_remove(Header, "OTU[0-9]+;size=")) %>%
  mutate(clustered_size = as.numeric(clustered_size)) %>%
  mutate(Header = str_remove(Header, ";size=[0-9]++")) %>%
  arrange(desc(clustered_size))
View(sequence.tbl)

We find there are 25 OTU’s. The reason we miss two (should be 27) could be that two of these taxa have too low abundance to be detected. Another possible explanation is that some are too similar, and fall inside the same OTU, given the 97% identity ‘radius’.

5.3.3 Exercise - centroid sizes

Read the file $tmp_folder/all_derep_minsize.fasta into R and store as demulti.tbl. Mutate in a new column called centroid_size and fill this with the size information in the Header texts, just as you did above.

We want to add this centroid_size information to the sequence.tbl from the previous exercise. Try to do this. Hints: Select away the Header column from demulti.tbl, then join the two table by Sequence, and make certain you use left_join() or right_join() such that you only end up with the sequences in the original sequence.tbl from the previous exercise (the demulti.tbl will typically have many more rows).

Then, make a plot of centroid_size versus clustered_size and use the OTU-names instead of markers (use geom_text()). The plot should look something like this:

Why are clustered_size’s always larger than centroid_size’s?

5.3.4 Exercise solution

Assume you run the code in the previous exercise first.

library(microseq)

demulti.tbl <- readFasta("/mnt/SCRATCH/larssn/vsearch/all_derep_minsize.fasta") %>%  # edit path
  mutate(centroid_size = str_remove(Header, ".+;size=")) %>%
  mutate(centroid_size = as.numeric(centroid_size)) %>%
  arrange(desc(centroid_size))
sequence.tbl <- demulti.tbl %>%
  select(-Header) %>%
  right_join(sequence.tbl, by = "Sequence")
ggplot(sequence.tbl) +
  geom_text(aes(x = centroid_size, y = clustered_size, label = Header)) +
  labs(x = "Centroid sizes", y = "Clustered sizes")

The centroid_size is the number of copies of the centroid sequence. The clustered_size is the sum of such sizes for all sequences clustered together in the same OTU, and this is why it must be at least as large as the centroid_size. Note that some OTU’s have a rather large clustered size, but still a very small centroid size. This means many different sequences have been clustered.

5.3.5 Exercise - the read count table

Read the file vsearch_readcounts.txt into R, and name it readcount.tbl. Rename the first column to OTU only. Next, compute the total read count for each OTU, and store this as a new column names total_size. Arrange the readcount.tbl by this column, largest values first. Add the total_size data to the sequence.tbl from above, again by joining the readcount.tbl with the sequence.tbl. What is the difference between the centroid_size, the clustered_size and the total_size for an OTU?

Hint: The function rowSums() will also work on a table, as long as all columns you sum across are numeric.

5.3.6 Exercise solution

Assume you run the code in the previous exercise first.

readcount.tbl <- read_delim("~/module10/vsearch/vsearch_readcounts.txt", delim = "\t") %>%  # edit path
  rename(OTU = `#OTU ID`) %>%
  mutate(total_size = rowSums(.[,-1])) %>%
  arrange(desc(total_size))
sequence.tbl <- readcount.tbl %>%
  select(OTU, total_size) %>%
  right_join(sequence.tbl, by = c("OTU" = "Header"))

The total_size is even bigger than clustered_size because we now assigned all reads to all OTU’s. While clustering we only work with the de-replicated reads who have a minimum size (the min_size).

5.3.7 Exercise - updating the sample table

In the original sample table, we would like to add a new column named vsearch_readpairs, containing the total read count for each sample (not each OTU).

Read the table into R, and from the readcount.tbl compute the total read counts for each sample, and add this to the sample table. Then, save this extended version of the sample table to the file using the same file name.

Hint: The statement summarise(across(everything(), sum)) may be useful here. The summarise() allows you to use a function across an entire column (here we use the function sum()). If you want to use this function across several columns, use it inside across(). Inside this you need to specify which columns, and everything() indicates all columns. You may in general select columns in the same way as in select().

5.3.8 Exercise solution

library(tidyverse)

sample.tbl <- read_delim("~/module10/sample_table.txt")                                     # edit path
readcount.tbl <- read_delim("~/module10/vsearch/vsearch_readcounts.txt", delim = "\t") %>%  # edit path
  rename(OTU = `#OTU ID`)
sum.tbl <- readcount.tbl %>%
  select(-OTU) %>%
  summarise(across(everything(), sum)) %>%
  pivot_longer(cols = everything(), names_to = "sample_id", values_to = "vsearch_readpairs")
sample.tbl <- left_join(sample.tbl, sum.tbl, by = "sample_id")
write_delim(sample.tbl, delim = "\t", file = "sample_table.txt")





6 The dada2 pipeline

This is an alternative to the vsearch processing we did above.

The dada2 R package is one of the most difficult to install, and for this reason the IT people have already installed it for us. But, as we will see below, we need to load a certain module when running this from a shell script. Installing dada2 on you local computer is also possible, but in my experience this is not straightforward either.

There is a dada2 tutorial on the web that you may have a look at. In the code below I have sampled from this, but still made things slightly different in order to be more similar to what we did with vsearch.

6.1 Part 1 - mainly filtering

Below we will now build one single R script with the code for processing reads with dada2, but we will build it step-by-step. We also make some temporary investigations below, with code you should not have in this dada2-script, but in some separate R script. Pay attention to this as you go on reading.

The first part of our dada2 R script contains some settings, reading the sample table and then some filtering/trimming of reads. We may also inspect the quality of the raw data. Note that this is now an R script, not a shell script, but we try to give it a similar structure. Put the following code into an R script and save it in your module10 folder as dada2.R:

library(tidyverse)
library(dada2)
library(microseq)

#############
### Settings
###
sample_table_file <- "sample_table.txt"
data_folder <- "/mnt/courses/BIN310/module10/fastq"
out_folder <- "dada2"
tmp_folder <- "/mnt/SCRATCH/larssn/dada2"           # edit this
threads <- 10
trim_right <- c(10, 10)                             # trim reads at 3' end as we did with vsearch
max_expected_error <- c(2.5, 2.5)                   # EE = expected errors
if(!dir.exists(out_folder)) dir.create(out_folder)  # creates out_folder if not already existing

################################
### Reading the sample table
###
sample.tbl <- read_delim(sample_table_file, delim = "\t")

############################
### Filter and trim reads
###
filt_R1_file <- file.path(tmp_folder, str_c("filtered_", sample.tbl$R1_file))
filt_R2_file <- file.path(tmp_folder, str_c("filtered_", sample.tbl$R2_file))
names(filt_R1_file) <- sample.tbl$sample_id
names(filt_R2_file) <- sample.tbl$sample_id
flt.tbl <- filterAndTrim(file.path(data_folder, sample.tbl$R1_file), filt_R1_file,
                         file.path(data_folder, sample.tbl$R2_file), filt_R2_file,
                         trimRight = trim_right,
                         maxEE = max_expected_error,
                         compress = F,
                         multithread = threads)

Notice that instead of looping over samples, as we did with vsearch, the dada2 will handle all samples in one go. This means we also get a new set of fastq files stored in our tmp_folder. The names of these files (in filt_R1_file and filt_R2_file), are given as input to filterAndTrim. Note that we added names to these vectors! The vector filt_R1_file contains 15 texts (file names). By using the names() function, we add another text as a ‘label’ to each of these vector elements. These labels are the sample_ids. This is not mandatory, but by doing this all samples will be labelled by their sample_id below. If we did not do this here, the file names themselves would be used as sample identifiers. Since we have sample_ids in our sample table we should use them.

We need to say something about the max_expected_error and the maxEE option in dada2. With vsearch we specified the max_error_probability when filtering. This is the tolerated average error probability of a read. Thus, we compute, from the quality scores, the error probabilities, and average these over all bases in the read. Here we instead specify the expected error. This is the error probability multiplied by the read length, i.e. instead of averaging the error probabilities, we sum them up. This must not exceed the value we set in max_expected_error. Think about this. A sum gets larger the more elements you sum, since all probabilities are positive. Thus, here we ‘punish’ long reads! If reads are trimmed to be super short, they will all pass this threshold! This would not happen in vsearch, where we average, regardless of how long the reads are. Here we use the max_expected_error of 2.5, which corresponds to an average probability of 0.01 if the reads are around 250 bases long. It is a mystery to me why dada2 uses a sum threshold instead of considering the average.

Save the script above in a file named dada2.R in your module10 folder.

Since this is an R script, we may in principle run this directly in RStudio. Don’t do this! The reason is that we do not have threads or memory available in RStudio, and that dada2 is slooooow! Therefore, we make a shell script where we do nothing else than start the R script. This allows us to sbatch to SLURM and run it on a computing node with the required number of threads and memory. Here is the simple script:

#!/bin/bash

#SBATCH --nodes=1
#SBATCH --reservation=BIN310
#SBATCH --account=bin310
#SBATCH --ntasks=10
#SBATCH --mem=50G
#SBATCH --time=05:00:00
#SBATCH --job-name=dada2
#SBATCH --output=dada2_%j.log

##################################
### Starting the dada2 R script
###
module load GCC/11.3.0
Rscript $HOME/module10/dada2.R

Save this as run_dada2.sh in your module10 folder.

Notice we module load GCC/11.3.0, and we never load R! This is a special arrangement we have to make on the BIN310 servers in order to make the dada2 work. If you add a usual module load R to your script, the dada2 will not be found!. As an ordinary user on orion you would install the dada2 package yourself, and then load R in the usual way when you use it. For now we will have to stick to this special solution in order to make dada2 run on the cn-18 or cn-19 servers.

Note that inside RStudio you can library(dada2) as usual, this extra loading of GCC is only required in the shell script.

Start the dada2 script by simply

sbatch run_dada2.sh

6.1.1 Exercise - read quality plot

The dada2 package has a function for plotting read quality scores from fastq files. Let us plot this for the first R2 file, before and after the filtering above.

The function you use is plotQualityProfile(). You need to give it as input a vector with the file names of the fastq files you like to plot. Use the first R2 file (conditionA_sample1_R1.fastq.gz), and the corresponding file after filtering. Remember to add the correct paths.

Look up the help file for this function to get some explanations of the colored lines on the plot.

6.1.2 Exercise solution

library(tidyverse)
library(dada2)

# The folders specified in the script above
data_folder <- "/mnt/courses/BIN310/module10/fastq"
tmp_folder <- "/mnt/SCRATCH/larssn/dada2"            # edit this

# The names of filtered data, from the script above
filt_R2_file <- file.path(tmp_folder, str_c("filtered_", sample.tbl$R2_file))

# Putting files into a vector, and plotting
fq.files <- c(file.path(data_folder, sample.tbl$R2_file[1]),
              filt_R2_file[1])
fig <- plotQualityProfile(fq.files)
print(fig)

Put this into a separate script, since we do not want this plotting each time we run the main dada2.R script from earlier.



6.2 Part 2 - the denoising

The main idea behind dada2 is to try to estimate the level of sequencing error, and then use this for grouping the sequences. Only sequences within sequencing error differences should belong to the same group.

First, we need to actually estimate (=learn) how the quality scores relate to differences in the sequences, i.e. estimate the rate of substitutions from one nucleotide to another. This step will in general take some time, and we therefore store the resulting objects in an RData file that we may load later, without re-computing them. Add the following to the dada2.R script from above:

#################################
### Estimating error rate models
###
err.obj_R1 <- learnErrors(filt_R1_file, multithread = threads)
err.obj_R2 <- learnErrors(filt_R2_file, multithread = threads)
save(err.obj_R1, err.obj_R2, file = file.path(tmp_folder, "error.objects.RData"))

Note that we train an error rate model for the R1 and R2 reads separately, resulting in two different error objects.

The next step is to de-replicate, which is also done separately for R1 and R2 reads. Add to the dada2.R script:

###########################
### De-replicating reads
###
derep.obj_R1 <- derepFastq(filt_R1_file, verbose = T)
derep.obj_R2 <- derepFastq(filt_R2_file, verbose = T)
save(derep.obj_R1, derep.obj_R2, file = file.path(tmp_folder, "derep.objects.RData"))

Again we save the resulting objects in an RData file in case we want to inspect them later.

Finally, the denoising, which is the very essence of the dada2 pipeline. Add to the dada2.R script:

################
### Denoising
###
dada.obj_R1 <- dada(derep.obj_R1, err.obj_R1, multithread = threads)
dada.obj_R2 <- dada(derep.obj_R2, err.obj_R2, multithread = threads)
save(dada.obj_R1, dada.obj_R2, file = file.path(tmp_folder, "dada.objects.RData"))

The three steps above are the time consuming ones when running the dada2. In our small example data set it does not take super long time, but with hundreds of samples, and many reads, this takes hours (days). This is why we choose to store all resulting objects in the tmp_folder.

Add the code above to the dada2.R script from earlier, and sbatch again. This will take some minutes to complete. Verify you have the three .RData files in your tmp_folder.

6.2.1 Exercise - plotting errors

The function plotErrors() may be used to plot the error structure estimated by dada2. Load the error objects from above (use load(<RData-file>)) and plot each of the two objects with plotErrors(). Read the help text to understand what is displayed.

6.2.2 Exercise solution

library(tidyverse)
library(dada2)

# From dada2 script
tmp_folder <- "/mnt/SCRATCH/larssn/dada2"

# loading RData file and plotting
load(file.path(tmp_folder, "error.objects.RData"))
r1.fig <- plotErrors(err.obj_R1)
print(r1.fig)
r2.fig <- plotErrors(err.obj_R2)
print(r2.fig)

Again, put this into a separate script, since we do not want this plotting each time we run the dada2.R script.



6.3 Part 3 - merging of reads

Down to here all processing has been done for R1 and R2 reads separately. Only now will we merge them! Note the huge difference to the vsearch pipeline, where we merged the read-pairs almost immediately. Add to the dada2.R script:

#####################
### Merging reads
###
merged.obj <- mergePairs(dada.obj_R1, derep.obj_R1,
                         dada.obj_R2, derep.obj_R2,
                         verbose = T)

Let us now stop and inspect this, since this is a step which is somewhat difficult to follow.

The code below in this section you do not add to your dada2.R script, put this code somewhere else (another R script) since it is something we do only here and now.

Let us focus on the very first sample only, with sample_id equal to sample_1A. If we look into the dada.obj_R1 and dada.obj_R2 objects from above, we find there is one object for each sample inside them. For each sample is listed among others something called denoised. This is the number of reads in each cluster. For the first sample it is:

library(tidyverse)
library(dada2)

load("/mnt/SCRATCH/larssn/dada2/dada.objects.RData")      # edit path
denoised_R1_readcounts <- dada.obj_R1$sample_1A$denoised
denoised_R2_readcounts <- dada.obj_R2$sample_1A$denoised

We notice for R1 reads there are 32 clusters with varying read counts, while for R2 there are only 21 clusters. Since R2 reads typically have more errors, this is perhaps not surprising. The names of each element in these vectors are the actual sequences, which makes them horrible to display. Let us remove these names and then display them:

names(denoised_R1_readcounts) <- NULL
print(denoised_R1_readcounts)
names(denoised_R2_readcounts) <- NULL
print(denoised_R2_readcounts)
##  [1] 4675 2024 1741 1511 1444 1365 1338 1064  846  867  684  361  341  295  299
## [16]  266  263  236  190  136  149  105   57   77   56   46   41   21   72  150
## [31]   70   38
##  [1] 2215 4954 1665 1612 1392 1701 1517  902  793 1052  219  275  251   74  908
## [16]  175  273  361  122  244   64

Notice we have for the first sample 4675 denoised R1 reads, but only 2215 denoised R2 reads. There are 32 clusters with denoised R1 reads, but only 21 clusters of denoised R2 reads.

Let us now run the merging only, and we load() the previous results from the tmp_folder to use as input:

load("/mnt/SCRATCH/larssn/dada2/dada.objects.RData")   # edit path
load("/mnt/SCRATCH/larssn/dada2/derep.objects.RData")  # edit path

merged.obj <- mergePairs(dada.obj_R1, derep.obj_R1,
                         dada.obj_R2, derep.obj_R2,
                         verbose = T)

Now, if we inspect the merged.obj we find one component for each sample, similar to what we saw for the dada objects above. The vector

merged_readcounts <- merged.obj$sample_1A$abundance
##  [1] 4632 1986 1640 1470 1391 1317 1200 1029  821  816  684  296  272  239  236
## [16]  172  131   72   68   44   22    7    5    4

contains the read counts for each merged cluster. Note that we have 24 clusters after merging. By merging the 32 R1 clusters with the 21 R2 clusters we end up with 24 merged clusters. How is this possible?

The vectors forward and reverse indicate which cluster from R1 and R2 have been merged to form the resulting merged clusters:

R1.idx <- merged.obj$sample_1A$forward
print(R1.idx[1:5])                      # inspect the merging of first 5 clusters
R2.idx <- merged.obj$sample_1A$reverse
print(R2.idx[1:5])                      # inspect the merging of first 5 clusters
## [1] 1 2 3 4 5
## [1] 2 1 3 6 5

We see that cluster 1 from R1 has been merged with cluster 2 from R2. We saw from denoised_R1_readcounts above that R1 cluster number 1 (denoised_R1_readcounts[1]) has 4675 reads, while the corresponding R2 cluster (denoised_R2_readcounts[2]) has 4954. After merging it is 4632 reads. What happened to the leftover reads from both clusters (R1 and R2)? This seems like a strange way of merging.

If we look through the R1.idx vector, we notice some clusters are merged more than once, e.g. cluster 3 from R2 is merged with three different R1 clusters!

The argument for merging after denoising is that we get a safer merging once both the R1 and R2 reads have been denoised, i.e. free from sequencing errors. But is this really the case? The denoising is not perfect. We must expect to lose some reads in this step, but this was the case also when merging was done early, like in the vsearch pipeline. However, the big issue with the dada2 merging, and what seems strange, is that we seems to lose all information about read pairs. When we start out, we know that the R1 and R2 reads are arranged in the fastq files in matching order. There is no such information left in the objects we use to merge here. It feels like we are not using an obvious piece of information here, but whether this has a substantial impact takes some efforts to investigate.



6.4 Part 4 - the final tables

Let us for now accept that the merging in dada2 is a little strange, and return to the script for processing the reads.

The readcount table is made from the merged object using the makeSequenceTable(), and there is also a function for filtering chimera right away. Add this to the dada2.R script from above:

#############################################
### Readcount matrix and chimera filtering
###
readcount.mat <- makeSequenceTable(merged.obj)
readcount.mat <- removeBimeraDenovo(readcount.mat,
                                    method = "consensus",
                                    multithread = threads,
                                    verbose = T)

The reason we name the resulting object readcount.mat, and not readcount.tbl, is that this is a matrix, not a table! It is also flipped/transposed compared to the table we made in our vsearch pipeline, i.e. samples are in the rows and OTUs, or ASVs as they are denoted in dada2 language, are in the columns. Also, instead of giving a short text identifier to each ASV, the sequence itself is the ‘name’ of it! This makes it horrible to inspect!

Let us immediately change this to a better format, and instead put the sequences into a sequence.tbl like we did for vsearch. Add this to the dada2.R script:

##########################
### The sequence table
###
sequence.tbl <- tibble(Header = str_c("ASV_", 1:ncol(readcount.mat)),
                       Sequence = colnames(readcount.mat))
writeFasta(sequence.tbl, out.file = file.path(out_folder, "dada2_centroids.fasta"))

########################
### The readcount table
###
colnames(readcount.mat) <- sequence.tbl$Header
readcount.tbl <- t(readcount.mat) %>% 
  as_tibble(rownames = "ASV")
write_delim(readcount.tbl, delim = "\t", file = file.path(out_folder, "dada2_readcounts.txt"))

where we also transposed the matrix to have the ASVs in the rows, just to have it in the same orientation as we did for vsearch.

To summarise, the dada2 also results in a sequence.tbl and a readcount.tbl having the same structure as with vsearch, but both sequences and the read count numbers will differ (slightly).

Having the complete script, run the dada2 again (using the run_dada2.sh script) and verify you get a folder named dada2/ in your module10 folder, with a centroid file and a read count table file, just as for vsearch.

6.4.1 Exercise - updating the sample table

After we used vsearch we added a column to the sample table, containing the number of read pairs in each sample after processing (vsearch_readpairs, see above).

Read the dada2 readcount table into R, and compute the corresponding total read counts for each sample, and add this to the sample table as column dada2_readpairs Then, save this extended version of the sample table to the file using the same file name.

Compare the numbers from vsearch and dada2. Is there a difference in how many reads that ‘survive’ the two pipelines?

6.4.2 Exercise solution

library(tidyverse)

sample.tbl <- read_delim("~/module10/sample_table.txt")                             # edit path
readcount.tbl <- read_delim("~/module10/dada2/dada2_readcounts.txt", delim = "\t")  # edit path
sum.tbl <- readcount.tbl %>%
  select(-ASV) %>%
  summarise(across(everything(), sum)) %>%
  pivot_longer(cols = everything(), names_to = "sample_id", values_to = "dada2_readpairs")
sample.tbl <- left_join(sample.tbl, sum.tbl, by = "sample_id")
write_delim(sample.tbl, delim = "\t", file = "sample_table.txt")

6.4.3 Exercise - comparing the centroids to the actual amplicons

The data set we have been using here is special in the sense that we happen to know exactly how the actual amplicons look like. This is of course not the case in most cases, but it allows us to compare the centroid sequences found by vsearch and dada2 to the ‘gold standard’ sequences and see how close to the ‘truth’ they came.

In the file $COURSES/BIN310/module10/amplicons.fasta you find the actual amplicons. Read this file into R and inspect it. You find there are 27 different species, but some species have more than one variant of the 16S amplicon. This is not uncommon. When we say that a genome has several ‘copies’ of the 16S, they are not always identical copies, but may have some small variation. Here we have in total 38 variants from the 27 species.

Read also the centroid (fasta) files we got from vsearch and dada2 into R. Compute how many of their sequences have an exact match to some of the amplicons. Hint: Use the match() function in R.

  • How many of the OTUs and ASVs have an exact match against some amplicon?
  • Are there cases where two different OTUs/ASVs have perfect matches in two amplicon variants from the same species?

6.4.4 Exercise solution

library(tidyverse)

### Read the sequences
amplicon.tbl <- readFasta("/mnt/courses/BIN310/module10/amplicons.fasta")
vsearch.tbl <- readFasta("vsearch/vsearch_centroids.fasta")
dada2.tbl <- readFasta("dada2/dada2_centroids.fasta")

## The vsearch comparison
vsearch.hit <- match(vsearch.tbl$Sequence, amplicon.tbl$Sequence)
cat("vsearch has", sum(!is.na(vsearch.hit)), "perfect matches to amplicons\n")
cat("The hits are:\n")
print(sort(amplicon.tbl$Header[vsearch.hit[!is.na(vsearch.hit)]]))

## The vsearch comparison
dada2.hit <- match(dada2.tbl$Sequence, amplicon.tbl$Sequence)
cat("dada2 has", sum(!is.na(dada2.hit)), "perfect matches to amplicons\n")
cat("The hits are:\n")
print(sort(amplicon.tbl$Header[dada2.hit[!is.na(dada2.hit)]]))

### Amplicons without perfect matches
with.hits <- sort(unique(c(vsearch.hit, dada2.hit)))
without.hits <- which(!(1:nrow(amplicon.tbl) %in% with.hits))
cat("Amplicons without hits:\n")
print(amplicon.tbl$Header[without.hits])

Remember that here we only consider perfect matches. It is likely that those OTUs/ASVs without perfect matches have a high similarity to some of the amplicons listed at the end above. To reveal this, we would have to make some BLASTing, like in assignment 2, or something along that line.





7 Taxonomic classifications

Whether we use vsearch or dada2 or any other procedure, the 16S amplicon data processing results in some sequence.tbl with all the representative sequences for each OTU/ASV, and a corresponding readcount.tbl. Even if it is not strictly necessary, it is customary to assign some taxonomic labels to each of these OTUs or ASVs, based on their sequences in the sequence.tbl.



7.1 Databases for 16S

We know from module 9 that in order to assign taxonomy to some sequence, we need some database of previously known sequences and their corresponding taxonomy. There are some huge repositories for 16S data, e.g. the SILVA, the RDP or the Greengenes. Note that while both SILVA and RDP contains full length 16S, the Greengenes2 has only amplicons. Thus, to use the latter you need to have amplified more or less the same region of the 16S as those in the database.

Again we face the potential problems due to a lack of an official taxonomy for prokaryotes. You will find that these databases may contain the same sequences, but assign them to different taxon names. Here is an article discussing this issue. Also, the naming changes over time and this makes the entire taxonomy problematic. Remember this when you work with taxonomy of prokaryotes.

Since our metabarcoding data are dealing with a smallish part of the 16S gene only, it is not possible to discriminate between too closely related organisms. The 16S gene is very conserved, and even if two genomes may differ a lot in their full genome content, they may still be extremely similar in their 16S genes. For this reason, it is difficult to classify such sequences reliably below the genus rank. We may be able to separate between species in some cases, but this also depends on how long our amplicons are, and which part of the 16S gene we have amplified.

Below we will use some tools for classification of 16S built into the vsearch and dada2. Both need a fasta file with known sequences having a known taxonomy, and where the taxonomy is given in the Header line of each sequence. They format this slightly different, but both follow a simple pattern that you easily see if you inspect these files. For this reason it is not very difficult for you to make your own database fasta file and use as input to both the tools below.



7.2 Classification using vsearch

As part of the vsearch tool we may classify 16S sequences using the sintax algorithm, see https://www.drive5.com/usearch/manual/cmd_sintax.html for more details on this. This requires a 16S database formatted in a special way, see the website https://www.drive5.com/usearch/manual/tax_annot.html. We have downloaded a fasta file from this website with full length 16S sequences in the proper format for sintax, and you find it in $COURSES/BIN310/module10/rdp_16s_v16.fa. These are sequences and taxonomy from the RDP repository mentioned above.

Here is a small script for doing the classification of the centroid sequences we found by using vsearch above:

#!/bin/bash

#SBATCH --nodes=1
#SBATCH --reservation=BIN310
#SBATCH --account=bin310
#SBATCH --ntasks=1
#SBATCH --mem=10G
#SBATCH --time=01:00:00
#SBATCH --job-name=sintax
#SBATCH --output=sintax_%j.log

##############
### Settings
###
vsearch_exe="apptainer exec $HOME/module10/vsearch_latest.sif vsearch"
database_file=$COURSES/BIN310/module10/rdp_16s_v16.fa
centroids_file=$HOME/module10/vsearch/vsearch_centroids.fasta
sintax_cutoff=0.0
out_file=vsearch/vsearch_taxonomy.txt

#####################
### Running sintax
###
$vsearch_exe \
 --sintax $centroids_file \
 --db $database_file \
 --sintax_cutoff $sintax_cutoff \
 --tabbedout $out_file

Note how we only reserve --ntasks=1, this is a small and simple job when we have so few sequences to classify. The sintax_cutoff is a threshold for how ‘confident’ we should be in order to classify. I like to set this at its lowest (0.0) here. Then (almost) all sequences are given some classification, and the confidence is also part of the output, and we can then read this into R and decide there what this threshold should be, without the need to re-run sintax.

Make this script in your module10 folder and sbatch to get some result file.

7.2.1 Exercise - Read results

The resulting file is a simple tab-delimited text file, without column names (inspect it). Make some R code for reading such files. Here is a list of what it should contain:

  • Read into R using read_delim() and with a tab delimiter, and name the object taxonomy.tbl. You should get 3 columns. Name them Header, taxonomy and OK, use the col_names option.
  • Select only the first two columns (discard the OK column).
  • The assigned taxonomy for each sequence is a long text. Split this into one column for each rank, i.e. the new table should have columns domain, phylum, class, order, family and genus instead of the original taxonomy. Use the function separate_wide_delim() for achieving this.
  • Get rid of the "d:" in front of all domain names, and similar for all the other ranks. Use mutate() and inside this str_remove(). Hint: Learn how to use the across() inside mutate(). This allows you to perform the same operation on several columns, and saves a lot of coding.
  • Behind each genus name (and all other ranks) there is a parenthesis with a number inside. This is the confidence score for the assignment. Mutate a new column named genus_confidence and put this value there, as a numerical. We could do similar for all ranks, but for now we are happy with genus only. Hint: Start with str_extract(genus, "\\(.+") inside a mutate() statement, then remove the parentheses and convert to numeric. Note that in a regular expression you need to use the double backslash "\\(" to specify a left-parenthesis symbol, since parentheses have a special meaning in regular expressions.
  • Again, clean all the rank names by removing these parenthesis (including the confidence inside).

Next, read in the centroids file (with readFasta()) and store it in sequence.tbl as we did above, it has columns Header and Sequence. Use full_join() to join this with the taxonomy.tbl.

Finally write this table to the same file you read in at the beginning here, i.e. you replace the original version of the file with this extended one.

It would perhaps be natural to add this R code to the shell script we made above. In this way the whole process is done inside one shell script each time.

7.2.2 Exercise solution

library(tidyverse)

taxonomy_file <- 'vsearch/vsearch_taxonomy.txt'
centroids_file <- 'vsearch/vsearch_centroids.fasta'

taxonomy.tbl <- read_delim(taxonomy_file, delim = '\t',
                         col_names = c('Header', 'taxonomy', 'OK')) %>%
  select(1:2) %>%
  separate_wider_delim(cols = taxonomy, delim = ',',
                       names = c('domain', 'phylum', 'class', 'order', 'family', 'genus')) %>%
  mutate(across(-Header, ~str_remove(., '^[dpcofg]:'))) %>%
  mutate(genus_confidence = str_extract(genus, '\\(.+')) %>%
  mutate(genus_confidence = str_remove_all(genus_confidence, '\\(|\\)')) %>%
  mutate(genus_confidence = as.numeric(genus_confidence)) %>%
  mutate(across(-Header, ~str_remove(., '\\(.+\\)')))

taxonomy.tbl <- readFasta(centroids_file) %>%
  full_join(taxonomy.tbl, by = 'Header')

write_delim(taxonomy.tbl, delim = "\t", file = taxonomy_file)



7.3 Classification using dada2

The dada2 package also has a function for doing taxonomic classification, named assignTaxonomy().

As a database fasta file, use /mnt/courses/BIN310/module10/RefSeq-RDP16S_v2_May2018.fa this time. This was downloaded from https://zenodo.org/record/2541239 and has the proper format required by the dada2 function assignTaxonomy().

Here is the R code needed for classifying the dada2 centroids we got earlier:

library(tidyverse)
library(microseq)
library(dada2)

############
### Settings
###
threads <- 10
database_file <- "/mnt/courses/BIN310/module10/RefSeq-RDP16S_v2_May2018.fa"
centroids_file <- "dada2/dada2_centroids.fasta"
out_file <- "dada2/dada2_taxonomy.txt"

########################
### Assigning taxonomy
###
sequence.tbl <- readFasta(centroids_file)
taxonomy.tbl <- assignTaxonomy(sequence.tbl$Sequence,
                               refFasta = database_file,
                               multithread = 10) %>% 
  as_tibble(rownames = "Sequence") %>% 
  full_join(sequence.tbl, by = "Sequence") %>% 
  select(Header, everything())
write_delim(taxonomy.tbl, delim = "\t", file = out_file)

Save this in a new R script (e.g. `dada2_taxonomy.R”).

The output from assignTaxonomy() is a matrix with one column for each taxonomic rank. The sequences are used a row names (horrible)! For this reason, we immediately convert it to a table (with as_tibble()) and specify to put the row names in a column named Sequence. Then we can directly join this with the sequence.tbl, who also has a similar Sequence column. In this way we get the OTU or ASV information into the table as well. Finally, we write this to a tab-separated text file.

To run this, make use of the small shell script we used when running dada2 itself. Just comment out the call to the dada2 R script, and call upon this instead (Rscript dada2_taxonomy.R).

If you inspect the column names, note that here the first rank is called Kingdom. This is not strictly correct, it should be either Superkingdom or Domain. We also notice this classifies down to the species rank. This depends on the database file used, both methods we have seen here now will try to classify to species if such information is available in the database files you supply.

7.3.1 Exercise - compare to gold standard

In this case we know exactly which organisms were in the community. Let us compare the taxonomic classifications to this.

First, read the file /mnt/courses/BIN310/module10/gold_standard.txt into a table. From the organism_name column, collect the genus name only and put this in a new column named genus. Select this column only, and keep only the distinct rows, and arrange it alphabetically.

Then, read in the taxonomy tables we produced for both vsearch and dada2 above, select the genus columns only, and compare to the gold standard. How many are correct? Are there any false positive and false negative genera?

7.3.2 Exercise solution

library(tidyverse)

gold.tbl <- read_delim("/mnt/courses/BIN310/module10/gold_standard.txt", delim = "\t") %>%
  mutate(genus = word(organism_name, 1)) %>%
  distinct(genus) %>%
  arrange(genus) %>%
  mutate(gold = genus)

vsearch_tax.tbl <- read_delim("vsearch/vsearch_taxonomy.txt", delim = "\t") %>%
  distinct(genus) %>%
  arrange(genus) %>%
  mutate(vsearch = genus)

dada2_tax.tbl <- read_delim("dada2/dada2_taxonomy.txt", delim = "\t") %>%
  distinct(Genus) %>%
  arrange(Genus) %>%
  mutate(dada2 = Genus)

all.tbl <- tibble(genus = unique(c(gold.tbl$genus, vsearch_tax.tbl$vsearch, dada2_tax.tbl$dada2))) %>%
  full_join(gold.tbl, by = "genus") %>%
  full_join(vsearch_tax.tbl, by = "genus") %>%
  full_join(dada2_tax.tbl, by = c("genus" = "Genus"))
knitr::kable(all.tbl)
genus gold vsearch dada2
Bacteroides Bacteroides Bacteroides Bacteroides
Bacteroidia Bacteroidia NA NA
Blautia Blautia Blautia Blautia
Clostridiaceae Clostridiaceae NA NA
Coprococcus Coprococcus Coprococcus Coprococcus
Dorea Dorea Dorea Dorea
Enterococcus Enterococcus Enterococcus Enterococcus
Escherichia Escherichia NA NA
Faecalibacterium Faecalibacterium Faecalibacterium Faecalibacterium
Flavonifractor Flavonifractor Flavonifractor Flavonifractor
Fusicatenibacter Fusicatenibacter Fusicatenibacter Fusicatenibacter
Lachnospiraceae Lachnospiraceae NA NA
Parabacteroides Parabacteroides Parabacteroides Parabacteroides
Roseburia Roseburia Roseburia Roseburia
Ruminococcus Ruminococcus Ruminococcus Ruminococcus
[Eubacterium] [Eubacterium] NA NA
Clostridium_XlVa NA Clostridium_XlVa NA
Clostridium_sensu_stricto NA Clostridium_sensu_stricto NA
Lachnospiracea_incertae_sedis NA Lachnospiracea_incertae_sedis NA
Lactonifactor NA Lactonifactor NA
Pantoea NA Pantoea NA
Ruminococcus2 NA Ruminococcus2 Ruminococcus2
Streptococcus NA Streptococcus Streptococcus
Clostridium XlVa NA NA Clostridium XlVa
Clostridium sensu stricto NA NA Clostridium sensu stricto
Escherichia/Shigella NA NA Escherichia/Shigella
Lachnospiracea incertae sedis NA NA Lachnospiracea incertae sedis

The gold standard has 16 distinct genera, and it looks like vsearch matches 10 and dada2 11 of them. But, here we also see the hassle we meet when comparing names! Some names seems to be the same, but has been written slightly different (e.g. space instead of underscore). Always keep in mind that taxonomy has an element of randomness in it. In fact, I think it would have been better to only use the NCBI taxonomy numbers (tax_id) that we saw briefly in module 9. These numbers do not change as often as the actual names, and there is only one way to write them…