vsearch
software.dada2
R package.
We are again faced with data from sequencing of a microbial community, but in this setting we do no longer sequence whole genomes. Instead, we design primers, and use PCR (Polymerase Chain Reaction) to amplify certain particular regions of the genomes, typically a marker gene. The most used marker is the 16S rRNA gene, which is found quite conserved in all prokaryotes, but other markers may be used. This is all down to which primers we use.
These amplified pieces from the genomes we call amplicons. Then we sequence these amplicons directly, since they are already quite short fragments. Thus, the fragments we sequence are not from any random position on the genomes, but from specific regions (genes) that we target by our primers. This leads to data of a different nature, and most processing and subsequent analysis differs slightly from what we have seen before.
Why do metabarcoding? The first answer is to estimate community
composition. We did this also in Module 9, using tools like
kraken2
and bracken
on shotgun data. However,
shotgun data is not always the best approach to estimate composition.
Shotgun sequencing requires more resources, in terms of sequencing and
data processing. It is in many cases overkill to do shotgun sequencing
if we are only interested in community composition. We would rather
spend our resources sequencing many samples instead of shotgun
sequencing a few. Second, from shotgun data we only recognize taxa we
have seen before, sometimes referred to as closed reference
assignments. If we investigate new environments, and expect to detect
organisms we have not seen before, the metabarcoding approach is usually
better, since this allows open reference assignments
(clustering). We can actually detect if we have detected something
new!
If we have a marker gene that we know is found in all organisms, we try to sequence only this gene. If this gene is slightly different in different organisms, we may try to group or cluster them. The number of groups should reflect the number of organisms. The abundance of each organism should correspond to the number of reads we have from each group. And this does not rely on any recognition of previously seen taxa. For this reason it is typical that we first use metabarcoding when digging into ‘unknown terrain’, i.e. microbial communities where we expect to find a number of new an unknown organisms. After we have mapped who is there, we may do shotgun sequencing of the full genomes, to learn more about the genes of the most abundant organisms.
Let us consider amplicon sequencing of the 16S rRNA gene, which is by far the most common marker gene. This is how it ideally should look like:
We have some microbial community with \(M\) taxa, and composition \(C_1, C_2,...,C_M\). Each taxon has a distinct 16S sequence. Still, all 16S genes in all taxa have some shorter regions in common, where they are identical. We design primers that match these identical regions, and amplify the same pieces of the 16S gene from all taxa. Then we sequence without any sequencing errors. The reads we get should then be of \(M\) distinct variants, and the read count for variant \(i\) should be exactly the fraction \(C_i\) of the total read counts.
Well, the world is not perfect. In reality we have to deal errors at several steps. There are more or less random errors, like sequencing errors, but also biases, i.e. systematic deviations. Common biases are:
It should be noted that biases like this may not always be a problem,
even if they are present. The point is that if you want to compare the
change in results due to some treatment etc, and the biases are
the same in all cases, you may still make valid conclusions about
changes taking place or not. We should, however, realize we have biased
data, and be very careful about how widely we conclude.
Let us have a short look at some PCR primers used for 16S amplification. A primer pair are two shorter DNA sequences that will bind to denatured DNA, and make a PCR copy of the region between where they bind. Thus, primer pairs always consist of a forward and a reverse primer. For a primer pair to amplify something from a certain genome, the forward primer must match exactly at some part of the genome, the reverse primer must match exactly at another part rather close by, and the region between (and including) these two matching regions is amplified. The question is if both primers actually match something.
Here are two primer pairs used in a recent study of metabarcoding arctic environments: Comparison of Two 16S rRNA Primers (V3–V4 and V4–V5) for Studies of Arctic Microbial Communities. These are supposed to amplify two sligthly different regions of the 16S gene, as suggested by the names I gave them here:
V3V4 <- c("F" = "CCTACGGGNGGCWGCAG", "R" = "GACTACHVGGGTATCTAATCC")
V4V5 <- c("F" = "GTGYCAGCMGCCGCGGTAA", "R" = "CCGYCAATTYMTTTRAGTTT")
The F
is the forward and R
is the reverse
primer in each pair.
Notice these primers contain not only A, C, G and T, but also what we
call IUPAC ambiguity symbols, see https://www.bioinformatics.org/sms/iupac.html.
Example, the symbol W
means A
or
T
. Thus, the primer sequences above actually represent
several distinct sequences, and this make them more universal. In the
microseq
package there are functions to convert from IUPAC
code to a regular expression, and back again. Use
iupac2regex()
on the primers to get the regular expression
text you may search with later:
library(microseq)
V3V4_F.regex <- iupac2regex(V3V4[1])
print(V3V4_F.regex)
## [1] "CCTACGGG[ACGT]GGC[AT]GCAG"
How many distinct sequences is this actually representing? Well, we can see one position with 4 variants and one with 2, i.e. 4*2=8 different combinations.
The regular expression text above we can now use for searching in 16S DNA sequence for a match. To search with the reverse primer as well, we must first reverse-complement it and then use ìupac2regex()`. This is designed to match on the negative DNA strand, and as long as we have only the positive 16S sequence we must reverse-complement the reverse primers in order to see if they match:
V3V4_R.regex <- reverseComplement(V3V4[2]) %>%
iupac2regex()
print(V3V4_R.regex)
## [1] "GGATTAGATACCC[CGT][AGT]GTAGTC"
This one obviously has 9 combinations, and together the two primers have 8*9=72 combinations.
To see if the primer pair actually has a match inside some 16S gene, we now form a regular expression of these two:
V3V4.regex <- str_c(V3V4_F.regex, ".+?", V3V4_R.regex)
print(V3V4.regex)
## [1] "CCTACGGG[ACGT]GGC[AT]GCAG.+?GGATTAGATACCC[CGT][AGT]GTAGTC"
The ".+?"
in the middle means ‘any symbols matched
number of times until first occurrence of’. Thus, by using this text as
a pattern in a text search we now look for region where we
Here we allow any number of matches between the forward and reverse primer. In reality the PCR reaction will only work for rather short fragment, and we could have set an upper limit here. We are not interested in matches thousands of bases long. On the other end, we could also see matches where the reverse primer starts to match before the forward primer has ended its match. Such cases would amplify, but is not picked up by our search above. The primers above are designed to match inside the 16S gene, and both the cases of super-long or super-short amplicons are unlikely here, and we stick to our regular expression above.
A drawback with such a search is that it requires exact matches. We know that even if the 16S sequence has a mismatch or two to the primer, it will still amplify, but poorer. Thus, an exact search is an OK estimate of the efficiency of the primers.
Let us search with the two primer pairs from above in some 16S sequences to see
We need some 16S sequence to search in. The SILVA database
is a public repository for Small Sub Unit (SSU) sequences that we may
download. The “small sub unit” means the 16S gene in prokaryotes and 18S
in eukaryotes. The full collection consists of around 2.2 million
sequences. Working with this big file is slow. In the file
COURSES/BIN310/module10/SILVA_138.1_SSURef_tax_silva_100K.fasta
you find a random subset of 100 000 sequences we use here now.
Make an R script where you read in the SILVA sequences. The sequences
are stored as RNA instead of DNA! This means an U in all places where
DNA has a T. First, mutate()
the Sequence
column to replace "U"
with "T"
using
str_replace_all()
. Next, extract the superkingdom
information from the Header
column and put into a new
column. Next, filter out all eukaryotes from the data, i.e. remove
entries where the superkingdom is "Eukaryota"
. Then compute
the length of all sequences and make a histogram of the sequence
lengths. How long are 16S sequences typically?
To search with both the primer-pairs from above among these
sequences, use first the code above to create the regular expressions,
and then use str_extract()
to search in the SILVA sequences
for matches, using one primer pair at the time. This will extracft the
amplicon regions, or be NA
if no matches were found. Store
these extracted sequences (amplicons) in two new columns (a
V3V4
and a V4V5
column) in the table you got
from reading the fasta file above.
How large fraction of the total number of sequences did the primers match?
Split the matches between superkingdoms Bacteria and Archaea . How well do the primers match the Archaea?
Finally, make histograms of the length of the amplicons you got from the two primer pairs.
library(tidyverse)
library(microseq)
# Code copied from above
V3V4 <- c("F" = "CCTACGGGNGGCWGCAG", "R" = "GACTACHVGGGTATCTAATCC")
V4V5 <- c("F" = "GTGYCAGCMGCCGCGGTAA", "R" = "CCGYCAATTYMTTTRAGTTT")
V3V4_F.regex <- iupac2regex(V3V4[1])
V3V4_R.regex <- reverseComplement(V3V4[2]) %>%
iupac2regex()
V3V4.regex <- str_c(V3V4_F.regex, ".+?", V3V4_R.regex)
# Reading silva sequences, and plotting length histogram
silva.tbl <- readFasta("SILVA_138.1_SSURef_tax_silva_100K.fasta") %>%
mutate(Sequence = str_replace_all(Sequence, "U", "T")) %>%
mutate(superkingdom = word(Header, 2)) %>%
mutate(superkingdom = word(superkingdom, 1, sep = ";")) %>%
filter(superkingdom != "Eukaryota") %>%
mutate(length = str_length(Sequence))
ggplot(silva.tbl) +
geom_histogram(aes(x = length), binwidth = 10)
# The other primer pair regular expression
V4V5.regex <- str_c(iupac2regex(V4V5[1]), ".+?", iupac2regex(reverseComplement(V4V5[2])))
# Searching, and putting the resulting amplicons into new column
silva.tbl <- silva.tbl %>%
mutate(V3V4 = str_extract(Sequence, V3V4.regex)) %>%
mutate(V4V5 = str_extract(Sequence, V4V5.regex))
cat("In total the V3V4 primers give", sum(!is.na(silva.tbl$V3V4)), "matches out of", nrow(silva.tbl), "\n")
## In total the V3V4 primers give 78054 matches out of 92281
cat("In total the V4V5 primers give", sum(!is.na(silva.tbl$V4V5)), "matches out of", nrow(silva.tbl), "\n")
## In total the V4V5 primers give 82387 matches out of 92281
# Count the number of matches for each superkingdom
sum.tbl <- silva.tbl %>%
group_by(superkingdom) %>%
summarise(n = n(), V3V4_matches = sum(!is.na(V3V4)), V4V5_matches = sum(!is.na(V4V5)))
for(i in 1:nrow(sum.tbl)){
cat(sum.tbl$superkingdom[i], ": V3V4 has", sum.tbl$V3V4_matches[i], "and V4V5 has",
sum.tbl$V4V5_matches[i], "out of", sum.tbl$n[i], "possible matches\n")
}
## Archaea : V3V4 has 16 and V4V5 has 2791 out of 3167 possible matches
## Bacteria : V3V4 has 78038 and V4V5 has 79596 out of 89114 possible matches
# Computing the amplicon lengths, and plotting their histograms
silva.tbl %>%
mutate(V3V4_length = str_length(V3V4)) %>%
mutate(V4V5_length = str_length(V4V5)) %>%
pivot_longer(cols = c(V3V4_length, V4V5_length), names_to = "primer", values_to = "amplicon_length") %>%
ggplot() +
geom_histogram(aes(x = amplicon_length, fill = primer), binwidth = 10) +
xlim(300, 500)
## Warning: Removed 24160 rows containing non-finite values (`stat_bin()`).
## Warning: Removed 4 rows containing missing values (`geom_bar()`).
We know that different organisms have a different number of 16S sequences in their genomes. We may sometimes want to correct for this, since an organism with many 16S copies will have more reads in the metabarcoding data even if the organism is not more abundant in terms of number of genomes. Let us see how we get hold of this information.
The rrnDB is a public database where the copy number for
16S genes are being stored. In
$COURSES/BIN310/module10/rrnDB-5.8.tsv
you find the
download file from the latest version of this database. It is, as the
name suggests, a simple tab-delimited text file.
Read this into R and select the columns containing the NCBI tax id,
the NCBI scientific name and the 16S copy numbers. Rename columns to
tax_id
, organism_name
and n_16S
,
respectively. Keep only rows where all three columns have values
(drop_na()
).
In module 9 we explored the NCBI Taxonomy database a little. Try to
add to the table above new columns listing the tax_id
values for the ranks: species, genus, family, order, class, phylum and
superkingdom for each entry. See in Module 9 how we used
read_nodes_dmp()
from the microclass
R package
to read the NCBI Taxonomy. Also, use the functions
branch_retrieve()
and branch_list2table()
from
the same package.
Compute the mean 16S copy number for all genera (hint:
group_by(genus)
and summarise()
). What are the
minimum, maximum and median of these mean values over all genera?
Here is a code skeleton to start out with:
library(tidyverse)
library(microclass)
# Reading the database file, selecting and discarding incomplete entries
rrndb.tbl <- read_delim(___, delim = "\t") %>%
select(___ = `NCBI tax id`, organism_name = ___, n_16S = ___) %>%
drop_na()
# The taxonomy of all unique tax_id's in the database
tax.tbl <- tibble(tax_id = unique(___))
nodes.tbl <- ___("/mnt/courses/BIN310/module9/NCBI/nodes.dmp")
tax.tbl <- branch_retrieve(___, ___) %>%
branch_list2table() %>%
bind_cols(tax.tbl)
# Now we join in all the taxonomy results
rrndb.tbl <- rrndb.tbl %>%
left_join(___, by = "tax_id")
# Compute some statistics for genera
genera.tbl <- ___ %>%
group_by(___) %>%
summarise(___ = mean(___))
library(tidyverse)
library(microclass)
# Reading the database file, selecting and discarding incomplete entries
rrndb.tbl <- read_delim("/mnt/courses/BIN310/module10/rrnDB-5.8.tsv", delim = "\t") %>%
select(tax_id = `NCBI tax id`, organism_name = `NCBI scientific name`, n_16S = `16S gene count`) %>%
drop_na()
# The taxonomy of all unique tax_id's in the database
tax.tbl <- tibble(tax_id = unique(rrndb.tbl$tax_id))
nodes.tbl <- read_nodes_dmp("/mnt/courses/BIN310/module9/NCBI/nodes.dmp")
tax.tbl <- branch_retrieve(tax.tbl$tax_id, nodes.tbl) %>%
branch_list2table() %>%
bind_cols(tax.tbl)
# Now we join in all the taxonomy results
rrndb.tbl <- rrndb.tbl %>%
left_join(tax.tbl, by = "tax_id")
# Compute some statistics for genera
genera.tbl <- rrndb.tbl %>%
group_by(genus) %>%
summarise(n_16S = mean(n_16S))
cat("Smallest mean copy number across genera =", min(genera.tbl$n_16S), "\n")
cat("Median mean copy number across genera =", median(genera.tbl$n_16S), "\n")
cat("Maximum copy number across genera =", max(genera.tbl$n_16S), "\n")
Below we will have a look at two procedures for processing raw metabarcoding data. In both cases we start out with a set of fastq files with metabarcoding reads. From these we try to find the representative sequences, i.e. those who represent the actual organisms in our microbial community. Since reads have some error in them, this is not straightforward, and different approaches are used. Each representative sequence forms what we denote an Operational Taxonomic Unit (OTU). Next, the remaining reads become members of some OTU, and the read count for each OTU indicates its abundance in the microbial community. This grouping usually involves several steps of raw read processing, and the grouping itself is done in different ways as we will see.
Once we have the groups or OTUs, we also like to do a taxonomic classification, to have some taxa names attached to the various groups, if possible. However, the grouping is the main thing, taxonomy assignment is not really required for many downstream analyses.
Why do we group the reads, and then assign to taxa? Why not assign
reads to taxa directly? We have seen tools like kraken2
that would do this in no time. The short answer is, we can! If we study
a community where all involved taxa are to be found in our database,
there is really no good argument for doing the detour of a grouping. In
such cases many will argue it is better to use a (shallow) shotgun
sequencing, to avoid the biases involved in amplicon sequencing. The
point is that in many (most?) cases our database does not contain all
taxa. Amplicon studies are best suited for investigating ‘new’
communities. Then, each new OTU is potentially a new organism, even if
it does not match anything in our taxonomy database.
We will look at a couple of tools for amplicon data processing. The
first, vsearch
, represents the ‘classical’ OTU-clustering
and there are several tools out there doing more or less the same thing.
The second, dada2
represents a denoising idea and
group all reads whose differences are due to sequencing errors only.
There are several other tools in use, but we do not have time to explore
them all.
To save money, it is common to sequence several samples in one sequencing run. This is called multiplexing. A sequencing run may produce a huge number of reads, and it is overkill to sequence this much from every sample. Prior to sequencing, a shortish DNA sequence denoted a barcode, is added to all fragments (amplicons). This barcode is identical for all fragments from the same sample, but distinctly different from the barcodes in other samples.
The raw reads from the sequencing machine are all mixed in one huge pool, and in order to sort out which reads are from which sample, we first group them by their barcode. This barcode sequences are usually at each end of the reads, e.g. the first or last (or both) 8 bases. Sorting reads into groups by their barcode sequence is called de-mutliplexing.
We will not do this step here. Often this is done automatically in
the sequencing process, and the fastq-files we get are already
de-multiplexed. If not, it is simply a matter of having a list of all
barcodes, searching for them among all the reads, and writing the reads
matching the same barcodes to the same fastq-file. There are tool for
this purpose, but it is actually pretty straightforward to write your
own (R-)code for doing this. Since the barcodes are usually rather short
we must require exact matching, and searching for exact matches is
simple in all programming languages.
Before we start computing, let us say a few words about data organization.
Metabarcoding data typically involve data from many samples. We will assume there is one pair of fastq-files for each sample (after de-multiplexing). Along with these files there should always be a file with a table, where the samples are listed in the rows. This table should typically contain a column with a short text identifying each sample, columns with the file names for each sample, plus additional columns with extra information about each sample. If samples are from various locations, time points, treatments etc, this is listed in these additional columns. We often refer to the latter as sample data or meta data.
Here is a very short example, with 2 samples only:
sample_id | R1_file | R2_file | Gut_part | Hive | Month | |
---|---|---|---|---|---|---|
Mar.K2.Ile2 | Mar_K2_Ile2_R1.fastq.gz | Mar_K2_Ile2_R2.fastq.gz | Ileum | 2 | March | |
Jun.K5.Rec5 | Jun_K5_Rec5_R1.fastq.gz | Jun_K5_Rec5_R2.fastq.gz | Rectum | 5 | June |
The four three columns is typically something we would always need,
i.e. some sample_id
with a text that uniquely identifies
each sample, and then the names of the two fastq-files (assuming we have
paired-end reads) associated with each sample. The remaining columns we
may make use of later when we come to the analysis part. In this course
we will refer to such a table as the sample.tbl
. This table
will accompany the processing and subsequent analysis of the
metabarcoding data. You may prepare this file in a spreadsheet like
Excel, store it as an .xlsx
file, and read into R using the
read_excel()
function from the readxl
package
(see exercise below).
As we go along with the reads processing we will compute two additional tables. The most central one is a table of read count values for each OTU from each sample. Finally, additional information about each OTU is stored in a third table. This could be the representative sequence behind each cluster (Header and Sequence), their taxonomic assignments etc. We may summarise this as
In /mnt/courses/BIN310/module10/
you find the Excel-file
bin310_samples.xlsx
. Make an R script in your
module10
folder, where you
read_excel()
function from
the readxl
R package. Store it in an object named
sample.tbl
.module10
folder.
Use write_delim()
and use tab ("\t"
) as the
delimiter between columns. Name the file
sample_table.txt
.We will make use of this file below.
library(tidyverse)
library(readxl)
sample.tbl <- read_excel("/mnt/courses/BIN310/module10/bin310_samples.xlsx")
View(sample.tbl)
write_delim(sample.tbl, delim = "\t", file = "sample_table.txt")
docker
containersWe have so far copied all our container software from the website [https://depot.galaxyproject.org/singularity/](https://depot.galaxyproject.org/singularity/]{target=“blank”}.
What if we look for some software that is not found there? We have
previously seen how we may use conda
as an alternative. Let
us now consider another alternative.
We want a container for the software vsearch
. If you
inspect the galaxy website, you will find such containers, but what
could we do if this was not the case?
In addition to apptainer
you also find another commonly
used approach for building containers, namely docker
. The
reason we do not use docker
in BIN310 is that it requires
administrator permissions, and this we do not have as ordinary users on
an HPC. However, containers built for/by docker
also works
with apptainer
!
Here is a short video showing the steps described below.
To find containers for docker
look up the dockerhub website.
Here you search for the software you look for. Often you get more than
one hit, since different developers may have supplied containers for the
same (or similar) software. You can sort the list by most recent updates
to get the newest versions. You still may have to decide which docker
container to go for. In the case of vsearch
we select the
one supplied by the user torognes (which we happen to know is
the main developer behind vsearch
, see the GitHub site for
this software). Next, click the Tags tab, and you should get
listed a number of versions. The first one in the latest, and
we copy the Docker Pull Command from this (right box).
Then we open a Terminal and paste in the copied command. It should look something like this:
docker pull torognes/vsearch
We must now edit this command to use apptainer
instead.
This means
docker
by apptainer
docker://
to the container addressIt should look like
apptainer pull docker://torognes/vsearch
Just return, and watch the container being pulled, i.e. it is
downloaded and built inside the folder you are in. After it has
finished, you should now have a .sif
file there with the
container! This should now work in the same way as the containers we
have downloaded from galaxy.
Beware that containers at dockerhub may be old or not contain exactly what the name suggests. I would in most cases prefer to copy from galaxy, but at least we now know of an alternative route to obtaining the software we need.
vsearch
read processingThe vsearch
was originally an opensource version of the
software usearch
, but has now several additional
facilities. It has been implemented to run extremely fast, making use of
parallel processing in a smart way. It offers a long range of options,
where a typical command line start with vsearch
followed by
a range of sub-commands and options. As usual you may execute it in the
Terminal to read its help text. Also, see the github site linked to at
the top of this document.
The processing of the raw data involves a number of steps, and we choose will split this into two separate scripts here now. The first script takes us from the raw fastq files to a set of fasta files, one for each sample. Here are the steps we will take:
Since we perform the exact same steps for each sample, we should now
repeat the code we are using, either by a for-loop or by using array
jobs. This time we choose the former. Why? In the processing steps, we
create some temporary files. If we use array jobs, then these temporary
files must have unique names, or be stored in unique folders, in order
to prevent one array job to overwrite the temporary files of another
while both are working. If we process everything sequentially (looping)
we process one sample at the time, and overwriting the files from the
previous sample is not a problem as long as the final results are in
unique files. Thus, with array jobs we could do the processing faster,
but get more temporary files to handle. Since the processing of a single
sample is quite fast anyway with vsearch
, we settle for the
slower solution using a loop and avoid the hassle with unique temporary
files. In cases where you have super-many samples, you should perhaps
consider using array jobs after all.
We also need to read the text file containing the sample table. We
produced such a text file in an exercise above, and you should have the
file sample_table.txt
inside your module10
folder. In module 7 we processed many genomes instead of many samples,
but then we also had a file where each row contained the information we
used. We make a similar solution here, using awk
to read in
such files.
You should have inspected the sample table, and find the R1 and R2
file names do not have a path. Where are they located? You find them all
in $COURSES/BIN310/module10/fastq/
.
Here is a shell script where we only process the first sample:
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --reservation=BIN310
#SBATCH --account=bin310
#SBATCH --ntasks=10
#SBATCH --mem=30G
#SBATCH --time=01:00:00
#SBATCH --job-name=vsearch1
#SBATCH --output=vsearch1_%j.log
#################
### Settings
###
sample_table_file=sample_table.txt
data_folder=$COURSES/BIN310/module10/fastq
vsearch_exe="apptainer exec $HOME/module10/vsearch_latest.sif vsearch"
tmp_folder=$SCRATCH/vsearch
max_error_probability=0.01
threads=10
if [ ! -d $tmp_folder ]
then
mkdir $tmp_folder
fi
#############################################
### Using awk to read the sample_table.txt
###
sample_id_column=$(awk -F"\t" 'NR==1{for(i=1;i<=NF;i++){f[$i] = i}}{print $(f["sample_id"])}' $sample_table_file)
R1_file_column=$(awk -F"\t" 'NR==1{for(i=1;i<=NF;i++){f[$i] = i}}{print $(f["R1_file"])}' $sample_table_file)
R2_file_column=$(awk -F"\t" 'NR==1{for(i=1;i<=NF;i++){f[$i] = i}}{print $(f["R2_file"])}' $sample_table_file)
n_rows=$(wc $sample_table_file | awk -vj=1 '{print $j}')
#################################
### Processing reads
###
for i in $(seq 2 2)
do
sample_id=$(echo $sample_id_column | awk -vj=$i '{print $j}')
R1_file=$(echo $R1_file_column | awk -vj=$i '{print $j}')
R2_file=$(echo $R2_file_column | awk -vj=$i '{print $j}')
echo "*************************************************************************"
echo "VSEARCH merging R1 and R2 reads..."
$vsearch_exe \
--fastq_mergepairs $data_folder/$R1_file \
--reverse $data_folder/$R2_file \
--fastqout $tmp_folder/merged.fq
echo "*************************************************************************"
echo "VSEARCH quality filtering..."
$vsearch_exe \
--fastq_filter $tmp_folder/merged.fq \
--fastq_maxee_rate $max_error_probability \
--fasta_width 0 \
--fastaout $tmp_folder/filtered.fa
rm $tmp_folder/merged.fq
echo "*************************************************************************"
echo "VSEARCH dereplicating..."
$vsearch_exe \
--derep_fulllength $tmp_folder/filtered.fa \
--minuniquesize 1 \
--strand plus \
--sizeout \
--relabel $sample_id:unique_read \
--fasta_width 0 \
--output $tmp_folder/$sample_id.fa
rm $tmp_folder/filtered.fa
done
Under Settings we set the name of the file having the sample table,
the folder where the reads files are located, and also how to execute
vsearch
using the container we pulled from dockerhub. Since
we start vsearch
many times in our code, it is convenient
to replace the long command line with a short
$vsearch_exe
.
We also specify a folder for temporary output. All output from this
script is actually temporary in the sense it is only used as input in
part 2 below. The max_error_probability
, given the value
0.01, we will come back to below.
We then read the file $sample_table_file
three times,
and collect the columns we need. Inside the loop, we collect the exact
piece of information we need from these columns. Note the loop starts at
2, not 1. This is because when reading with awk
the column
headers is the first row and what we usually think of as row i is
actually row i+1 in this case.
The first step where we use vsearch
is to merge read
pairs. The two reads in a pair come from each end of the genomic
fragment we sequence, as always. These fragments are now amplicons that
we have copied out of the 16S gene. By designing the primers to match at
various locations, we get amplicons of certain lengths. We saw this
above in an exercise where we did an in silico amplification.
These amplicons are typically designed to be less than two times the
read lengths, i.e. when we sequence from each end there is a region in
the middle where the two reads overlap. Due to this, we can align the
two reads, and merge them into one longer read spanning the entire
amplicon. Notice that the output from this step is a fastq-file, i.e. we
still have quality scores for the merged sequence.
In the next step we filter the reads based on the quality scores. The
option --fastq_maxee_rate
makes use of the
$max_error_probability
threshold. We know from before how
to convert quality scores to error probabilities (module 2). Having this
probability for each base, we can compute the average of these, and if
this is above the $max_error_probability
we discard the
read. Note that an error probability of 0.01 corresponds to a quality
score of 20. You may want to increase this if too many reads are
discarded. Note that after filtering we store the reads in fasta format,
since we no longer need the quality scores.
Finally, reads are de-replicated. This simply means that instead of
storing many copies of the exact same read, we just store one copy, and
then add the size
information to its Header line (fasta
format). We also use --relabel
to add the
$sample_id
text to the Header line of each read. This is
very important! Later we mix all reads from all samples, and then we
need this information to follow every read in order to keep track of
which read came from which sample.
Earlier in BIN310 we saw data from shotgun sequencing. Then, it is almost never two reads who are identical, and a de-replication would seem like a waste of time. But, for metabarcoding data, all reads come from the same region of the same gene, and for this reason reads are very often identical.
We know that the R1- and R2-reads are listed as pairs in the fastq files, i.e. read number 1 in both files belong together, etc. When we design the primers we use to amplify our amplicons, we must test how long the amplicons will typically be. In the case of the data we use here, the amplicons are mostly in the range 410-420 bases. Then we sequence with reads of length 250 from each side (Illumina MiSeq). This means we should have an overlap region of at least 80 bases (250 + 250 - 420). Thus, the last 80 bases of the R1-read should be the same sequence as the reverse-complement of the last 80 bases of the R2-read.
We also know there are sequencing errors, and we cannot expect these
overlapping sequences to be identical each time. We also know that
sequence quality is poorer towards the end of both reads, exactly where
we try to match them! Thus, in some cases the merging will be done in a
wrong way, and in some cases not at all. These reads are then lost from
the downstream analysis. There are many options you may give to the
fastq_mergepairs
subcommand, in order to tune the merging.
When you run script above, and inspect the log-file, you will find
vsearch
will give us some output on how this turned out.
Here is a small example:
Merging reads 100%
28331 Pairs
25734 Merged (90.8%)
2597 Not merged (9.2%)
Pairs that failed merging due to various reasons:
27 multiple potential alignments
2543 too many differences
27 alignment score too low, or score drop to high
Statistics of all reads:
250.00 Mean read length
Statistics of merged reads:
410.83 Mean fragment length
1.67 Standard deviation of fragment length
0.99 Mean expected error in forward sequences
3.84 Mean expected error in reverse sequences
1.88 Mean expected error in merged sequences
0.57 Mean observed errors in merged region of forward sequences
2.17 Mean observed errors in merged region of reverse sequences
2.74 Mean observed errors in merged region
We see that around 90% of the read pairs merged here, and that lack of merging was mainly due to too many differences in the overlap region. One obvious way to improve merging may be to trim the reads at the 3’ end, ie. simply trim the reads at their ‘bad’ end. Since we have plenty of bases overlap here, we can afford to do this.
Important! Before you trim, make certain you know approximately the length of the overlap region. If you trim too hard, this becomes too short, and nothing merges!
When we merged the R1 and R2 reads, how does the quality sequence look like? In the overlapping region there must be some mix of the quality sequences from both? We will not dig into this here, but I recommend reading this article if you want to learn more about this.
When we look at the dada2
software below, we will return
to this merging, since this is perhaps where vsearch
and
dada2
differs the most.
Make the shell script using the code above, and sbatch
.
You should get a fasta file with the processed reads for the first
sample in sample_table.txt
.
Make an R script where you read in this fasta file, and find answers to the following questions:
size
) of any
read?library(tidyverse)
library(microseq)
first.tbl <- readFasta("/mnt/SCRATCH/larssn/vsearch/sample_1A.fa") %>% # edit path
mutate(length = str_length(Sequence)) %>%
mutate(size = as.numeric(word(Header, -1, sep = "="))) %>%
arrange(desc(size))
cat("There are", nrow(first.tbl), "unique reads in this sample\n")
cat("The largest copy number is", first.tbl$size[1], "\n")
cat("The total number of reads is", sum(first.tbl$size), "\n")
fig <- ggplot(first.tbl) +
geom_histogram(aes(x = length), binwidth = 1) +
labs(x = "Amplicon length", y = "Number of unique reads")
print(fig)
cat("The average read length is", sum(first.tbl$length * first.tbl$size) / sum(first.tbl$size), "\n")
Note that the ‘typical’ amplicon length is the same as the ‘typical’ read length after merging of the read pairs. Also, note that to get a correct average read length you cannot in general just take the mean length of the unique reads! You have to weight this with how often each read occurs, i.e. the size. Imagine you have 12 children and measure their length. One is 100, one is 110 and ten of them are 120. There are 3 unique lengths (100, 110, 120), but the mean of all 12 children is not 110! The length 120 has ‘size’ 10 while the other two have ‘size’ 1…
Add code to the shell script from above, to trim both R1 and R2 reads
by 10 bases at their 3’ ends. See the vsearch
help on
Trimming and filtering for the correct subcommand and options to do
this. NB! You need to trim each fastq file separately. Note that this
means you need to store the output as two new fastq files, typically in
the out_folder
folder.
Why would we trim the 3’ end of the reads like this? We will discuss
this below, after we have used the dada2
to also process
the reads.
Here we only show the code that is added/changed, the rest is above:
#################
### Settings
###
trim_right_R1=10
trim_right_R2=10
#################################
### Processing reads
###
for i in $(seq 2 2)
do
sample_id=$(echo $sample_id_column | awk -vj=$i '{print $j}')
R1_file=$(echo $R1_file_column | awk -vj=$i '{print $j}')
R2_file=$(echo $R2_file_column | awk -vj=$i '{print $j}')
# The new code for trimming at 3' end (right end)...
echo "*************************************************************************"
echo "VSEARCH trimming R1 reads before merging..."
$vsearch_exe \
--fastq_filter $data_folder/$R1_file \
--fastq_stripright $trim_right_R1 \
--fastq_minlen 0 \
--fastqout $tmp_folder/trimmed_R1.fq
echo "*************************************************************************"
echo "VSEARCH trimming R2 reads before merging..."
$vsearch_exe \
--fastq_filter $data_folder/$R2_file \
--fastq_stripright $trim_right_R2 \
--fastq_minlen 0 \
--fastqout $tmp_folder/trimmed_R2.fq
# And we need to change the merging slightly, now using the trimmed_R1.fa and trimmed_R2.fq as input
echo "*************************************************************************"
echo "VSEARCH merging R1 and R2 reads..."
$vsearch_exe \
--fastq_mergepairs $tmp_folder/trimmed_R1.fq \
--reverse $tmp_folder/trimmed_R2.fq \
--fastqout $tmp_folder/merged.fq
rm $tmp_folder/trimmed_R1.fq
rm $tmp_folder/trimmed_R2.fq
#...rest as before
done
Finally, extend the last script by processing all samples instead of only the first. You should then have a fasta file with processed reads for each sample listed in the sample table.
Well, this is just a matter of extending the for-loop…
for i in $(seq 2 $n_rows)
Notice that we already have the shell variable n_rows
from earlier in the script. This contains the number of rows (lines) in
the sample table file (note the shell code we used to read this).
The final part 1 script is the one from the exercises, i.e. where we include the trimming of the 3’ end, and loop over all samples. You need to run this in order to have the files we will work more on in the next section.
We now continue with the fasta files we produced for each sample above.
The following code ypu may either add to the previous one, to get it all in one single (big) shellscript, or you may split them and make two separate scripts that you run in sequence. Here we use the latter solution, having a separate shell script for part 2. Here is a template for this second script:
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --reservation=BIN310
#SBATCH --account=bin310
#SBATCH --ntasks=10
#SBATCH --mem=30G
#SBATCH --time=01:00:00
#SBATCH --job-name=vsearch2
#SBATCH --output=vsearch2_%j.log
##############
### Settings
###
min_size=2 # Minimum copy number of centroid sequence
OTU_identity=0.97 # identity threshold
tmp_folder=$SCRATCH/vsearch # assume fasta files from part 1 is in here
vsearch_exe="apptainer exec $HOME/module10/vsearch_latest.sif vsearch"
out_folder=$HOME/module10/vsearch # final results end here
threads=10
if [ ! -d $out_folder ]
then
mkdir $out_folder
fi
echo "*************************************************************************"
echo "Reads from all samples into one file..."
if [ -f $tmp_folder/all.fasta ]
then
rm $tmp_folder/all.fasta
fi
cat $tmp_folder/*.fa >> $tmp_folder/all.fasta
echo "*************************************************************************"
echo "VSEARCH dereplicating all sequences..."
$vsearch_exe \
--derep_fulllength $tmp_folder/all.fasta \
--minuniquesize $min_size \
--strand plus \
--sizein \
--sizeout \
--relabel_sha1 \
--fasta_width 0 \
--output $tmp_folder/all_derep_minsize.fasta
echo "*************************************************************************"
echo "VSEARCH OTU clustering..."
$vsearch_exe \
--cluster_size $tmp_folder/all_derep_minsize.fasta \
--id $OTU_identity \
--strand plus \
--sizein \
--sizeout \
--relabel OTU \
--threads $threads \
--centroids $out_folder/vsearch_centroids.fasta
echo "*************************************************************************"
echo "VSEARCH chimera filtering..."
$vsearch_exe \
--uchime_denovo $out_folder/vsearch_centroids.fasta \
--sizein \
--nonchimeras $tmp_folder/nonchimeras.fasta
mv $tmp_folder/nonchimeras.fasta $out_folder/vsearch_centroids.fasta
echo "*************************************************************************"
echo "VSEARCH assigning reads to OTU's..."
$vsearch_exe \
--usearch_global $tmp_folder/all.fasta \
--db $out_folder/vsearch_centroids.fasta \
--id $OTU_identity \
--strand plus \
--sizein \
--threads $threads \
--otutabout $out_folder/vsearch_readcounts.txt
After the settings we start by writing all reads from all samples to
one large fasta file. Reads from all samples should be considered in the
subsequent clustering. The same organisms may be seen in several
samples, and using all reads give us higher read counts for each OTU,
and this makes the clustering easier. Needless to say, this file may
become huge if we have many samples! Luckily vsearch
is
fast, but this may be a memory limitation if you run it on your local
computer.
Next step is de-replication of all reads. Here we set the minimum
copy number to 2 (the min_size
variable). The idea is that
reads we only see once in this huge set of reads, most likely contain
sequencing errors. We do not discard them, but these reads should never
be allowed to form a separate cluster. They may later be assigned to
some cluster, though. You may increase min_size
to even
larger values.
The next step is the clustering of all de-replicated sequences. This
is the crucial step. Here we decide which reads belong together and
forms the abundance of each cluster. We typically refer to such clusters
as Operational Taxonomic Units (OTU). In principle we think of them all
as taxa, but we may not have a name for them all, and OTU indicates
this. The identity
sets the size of all clusters. The
centroid sequence of a cluster is the sequence with largest
copy number in the cluster. This is the representative sequence for that
cluster. Then, all other members must have identity above
identity
to this centroid sequence. The output from the
clustering is a fasta file with these centroid sequences (here named
vsearch_centroids.fasta
).
Then we do a chimera filtering. What is a chimera? This is
an artefact of the PCR amplification process we do before sequencing.
During this process you may get amplicons who are mxitures of two
original amplicons, i.e. the first part is from one organism and the
second apart is from another. Such reads should be discarded. Here we
output the non-chimera sequences to a temporary file, and then copy this
to overwrite the file vsearch_centroids.fasta
. The latter
should now be without chimeras.
Finally, we assign all reads to the OTUs we found above. We
use the fasta file with all reads (all_reads.fasta
) that we
created above, i.e. all reads are now being considered, also those
occurring only once, or those who are chimeras. The point is that each
read is compared to the centroids, and if it is more than
identity
similar to some centroid, it becomes member of
that OTU. In this step we now rely on each read having the
$sample_id
text in its Header
line. Thus, the
line --relabel $sample_id:unique_read
in the last part of
the first script above is crucial! Without this, it would now be
impossible to know from which sample the various reads come from. For
this reason, you cannot use the de-replicated reads in this assignment,
since then this information is lost.
The output from the assignment is a table of read counts. The first
column lists all OTUs, i.e. the text you find in the Header
lines of the centroids, and then there is one column for each sample.
The numbers are read counts, i.e. how many reads each OTU has in each
sample. Each row is an OTU, each column a sample. This is saved in the
file vsearch_readcounts.txt
as a tab-delimited text
file.
Read the file centroids.fasta
into R and name the object
sequence.tbl
. Add a new column clustered_size
and fill it with the size
numbers you find inside the
Header
texts (size=xxx
). Then, remove this
size-information from the Header
s, leaving only the
OTU_xxx
. Sort the table by clustered_size
,
largest values first. For this particular data set we know
there are exactly 27 species in the microbial community. How does this
compare to the numbers of clusters/OTUs/taxa we got here? What might be
the reasons for any discrepancy?
library(tidyverse)
library(microseq)
sequence.tbl <- readFasta("~/module10/vsearch/vsearch_centroids.fasta") %>% # edit path"
mutate(clustered_size = str_remove(Header, "OTU[0-9]+;size=")) %>%
mutate(clustered_size = as.numeric(clustered_size)) %>%
mutate(Header = str_remove(Header, ";size=[0-9]++")) %>%
arrange(desc(clustered_size))
View(sequence.tbl)
We find there are 25 OTU’s. The reason we miss two (should be 27) could be that two of these taxa have too low abundance to be detected. Another possible explanation is that some are too similar, and fall inside the same OTU, given the 97% identity ‘radius’.
Read the file $tmp_folder/all_derep_minsize.fasta
into R
and store as demulti.tbl
. Mutate in a new column called
centroid_size
and fill this with the size
information in the Header
texts, just as you did above.
We want to add this centroid_size
information to the
sequence.tbl
from the previous exercise. Try to do this.
Hints: Select away the Header
column from
demulti.tbl
, then join the two table by
Sequence
, and make certain you use left_join()
or right_join()
such that you only end up with the
sequences in the original sequence.tbl
from the previous
exercise (the demulti.tbl
will typically have many more
rows).
Then, make a plot of centroid_size
versus
clustered_size
and use the OTU-names instead of markers
(use geom_text()
). The plot should look something like
this:
Why are clustered_size
’s always larger than
centroid_size
’s?
Assume you run the code in the previous exercise first.
library(microseq)
demulti.tbl <- readFasta("/mnt/SCRATCH/larssn/vsearch/all_derep_minsize.fasta") %>% # edit path
mutate(centroid_size = str_remove(Header, ".+;size=")) %>%
mutate(centroid_size = as.numeric(centroid_size)) %>%
arrange(desc(centroid_size))
sequence.tbl <- demulti.tbl %>%
select(-Header) %>%
right_join(sequence.tbl, by = "Sequence")
ggplot(sequence.tbl) +
geom_text(aes(x = centroid_size, y = clustered_size, label = Header)) +
labs(x = "Centroid sizes", y = "Clustered sizes")
The centroid_size
is the number of copies of the
centroid sequence. The clustered_size
is the sum of such
sizes for all sequences clustered together in the same OTU, and this is
why it must be at least as large as the centroid_size
. Note
that some OTU’s have a rather large clustered size, but still a very
small centroid size. This means many different sequences have
been clustered.
Read the file vsearch_readcounts.txt
into R, and name it
readcount.tbl
. Rename the first column to OTU
only. Next, compute the total read count for each OTU, and store this as
a new column names total_size
. Arrange the
readcount.tbl
by this column, largest values first. Add the
total_size
data to the sequence.tbl
from
above, again by joining the readcount.tbl
with the
sequence.tbl
. What is the difference between the
centroid_size
, the clustered_size
and the
total_size
for an OTU?
Hint: The function rowSums()
will also work on a table,
as long as all columns you sum across are numeric.
Assume you run the code in the previous exercise first.
readcount.tbl <- read_delim("~/module10/vsearch/vsearch_readcounts.txt", delim = "\t") %>% # edit path
rename(OTU = `#OTU ID`) %>%
mutate(total_size = rowSums(.[,-1])) %>%
arrange(desc(total_size))
sequence.tbl <- readcount.tbl %>%
select(OTU, total_size) %>%
right_join(sequence.tbl, by = c("OTU" = "Header"))
The total_size
is even bigger than
clustered_size
because we now assigned all reads to all
OTU’s. While clustering we only work with the de-replicated reads who
have a minimum size (the min_size
).
In the original sample table, we would like to add a new column named
vsearch_readpairs
, containing the total read count for each
sample (not each OTU).
Read the table into R, and from the readcount.tbl
compute the total read counts for each sample, and add this to the
sample table. Then, save this extended version of the sample table to
the file using the same file name.
Hint: The statement summarise(across(everything(), sum))
may be useful here. The summarise()
allows you to use a
function across an entire column (here we use the function
sum()
). If you want to use this function across several
columns, use it inside across()
. Inside this you need to
specify which columns, and everything()
indicates all
columns. You may in general select columns in the same way as in
select()
.
library(tidyverse)
sample.tbl <- read_delim("~/module10/sample_table.txt") # edit path
readcount.tbl <- read_delim("~/module10/vsearch/vsearch_readcounts.txt", delim = "\t") %>% # edit path
rename(OTU = `#OTU ID`)
sum.tbl <- readcount.tbl %>%
select(-OTU) %>%
summarise(across(everything(), sum)) %>%
pivot_longer(cols = everything(), names_to = "sample_id", values_to = "vsearch_readpairs")
sample.tbl <- left_join(sample.tbl, sum.tbl, by = "sample_id")
write_delim(sample.tbl, delim = "\t", file = "sample_table.txt")
dada2
pipelineThis is an alternative to the vsearch
processing we did
above.
The dada2
R package is one of the most difficult to
install, and for this reason the IT people have already installed it for
us. But, as we will see below, we need to load a certain module when
running this from a shell script. Installing dada2
on you
local computer is also possible, but in my experience this is not
straightforward either.
There is a dada2 tutorial
on the web that you may have a look at. In the code below I have sampled
from this, but still made things slightly different in order to be more
similar to what we did with vsearch
.
Below we will now build one single R script with the code for
processing reads with dada2
, but we will build it
step-by-step. We also make some temporary investigations below, with
code you should not have in this dada2
-script, but
in some separate R script. Pay attention to this as you go on
reading.
The first part of our dada2
R script contains some
settings, reading the sample table and then some filtering/trimming of
reads. We may also inspect the quality of the raw data. Note that this
is now an R script, not a shell script, but we try to give it a similar
structure. Put the following code into an R script and save it in your
module10
folder as dada2.R
:
library(tidyverse)
library(dada2)
library(microseq)
#############
### Settings
###
sample_table_file <- "sample_table.txt"
data_folder <- "/mnt/courses/BIN310/module10/fastq"
out_folder <- "dada2"
tmp_folder <- "/mnt/SCRATCH/larssn/dada2" # edit this
threads <- 10
trim_right <- c(10, 10) # trim reads at 3' end as we did with vsearch
max_expected_error <- c(2.5, 2.5) # EE = expected errors
if(!dir.exists(out_folder)) dir.create(out_folder) # creates out_folder if not already existing
################################
### Reading the sample table
###
sample.tbl <- read_delim(sample_table_file, delim = "\t")
############################
### Filter and trim reads
###
filt_R1_file <- file.path(tmp_folder, str_c("filtered_", sample.tbl$R1_file))
filt_R2_file <- file.path(tmp_folder, str_c("filtered_", sample.tbl$R2_file))
names(filt_R1_file) <- sample.tbl$sample_id
names(filt_R2_file) <- sample.tbl$sample_id
flt.tbl <- filterAndTrim(file.path(data_folder, sample.tbl$R1_file), filt_R1_file,
file.path(data_folder, sample.tbl$R2_file), filt_R2_file,
trimRight = trim_right,
maxEE = max_expected_error,
compress = F,
multithread = threads)
Notice that instead of looping over samples, as we did with
vsearch
, the dada2
will handle all samples in
one go. This means we also get a new set of fastq files stored in our
tmp_folder
. The names of these files (in
filt_R1_file
and filt_R2_file
), are given as
input to filterAndTrim
. Note that we added names to these
vectors! The vector filt_R1_file
contains 15 texts (file
names). By using the names()
function, we add another text
as a ‘label’ to each of these vector elements. These labels are the
sample_id
s. This is not mandatory, but by doing this all
samples will be labelled by their sample_id
below. If we
did not do this here, the file names themselves would be used as sample
identifiers. Since we have sample_id
s in our sample table
we should use them.
We need to say something about the max_expected_error
and the maxEE
option in dada2
. With
vsearch
we specified the max_error_probability
when filtering. This is the tolerated average error probability
of a read. Thus, we compute, from the quality scores, the error
probabilities, and average these over all bases in the read. Here we
instead specify the expected error. This is the error
probability multiplied by the read length, i.e. instead of averaging the
error probabilities, we sum them up. This must not exceed the value we
set in max_expected_error
. Think about this. A sum gets
larger the more elements you sum, since all probabilities are positive.
Thus, here we ‘punish’ long reads! If reads are trimmed to be super
short, they will all pass this threshold! This would not happen in
vsearch
, where we average, regardless of how long the reads
are. Here we use the max_expected_error
of 2.5, which
corresponds to an average probability of 0.01 if the reads are around
250 bases long. It is a mystery to me why dada2
uses a sum
threshold instead of considering the average.
Save the script above in a file named dada2.R
in your
module10
folder.
Since this is an R script, we may in principle run this directly in
RStudio. Don’t do this! The reason is that we do not
have threads or memory available in RStudio, and that dada2
is slooooow! Therefore, we make a shell script where we do nothing else
than start the R script. This allows us to sbatch
to SLURM
and run it on a computing node with the required number of threads and
memory. Here is the simple script:
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --reservation=BIN310
#SBATCH --account=bin310
#SBATCH --ntasks=10
#SBATCH --mem=50G
#SBATCH --time=05:00:00
#SBATCH --job-name=dada2
#SBATCH --output=dada2_%j.log
##################################
### Starting the dada2 R script
###
module load GCC/11.3.0
Rscript $HOME/module10/dada2.R
Save this as run_dada2.sh
in your module10
folder.
Notice we module load GCC/11.3.0
, and we never load R!
This is a special arrangement we have to make on the BIN310 servers in
order to make the dada2
work. If you add a usual
module load R
to your script, the dada2
will
not be found!. As an ordinary user on orion you would install the
dada2
package yourself, and then load R in the usual way
when you use it. For now we will have to stick to this special solution
in order to make dada2
run on the cn-18
or
cn-19
servers.
Note that inside RStudio you can library(dada2)
as
usual, this extra loading of GCC
is only required in the
shell script.
Start the dada2
script by simply
sbatch run_dada2.sh
The dada2
package has a function for plotting read
quality scores from fastq files. Let us plot this for the first R2 file,
before and after the filtering above.
The function you use is plotQualityProfile()
. You need
to give it as input a vector with the file names of the fastq files you
like to plot. Use the first R2 file
(conditionA_sample1_R1.fastq.gz
), and the corresponding
file after filtering. Remember to add the correct paths.
Look up the help file for this function to get some explanations of the colored lines on the plot.
library(tidyverse)
library(dada2)
# The folders specified in the script above
data_folder <- "/mnt/courses/BIN310/module10/fastq"
tmp_folder <- "/mnt/SCRATCH/larssn/dada2" # edit this
# The names of filtered data, from the script above
filt_R2_file <- file.path(tmp_folder, str_c("filtered_", sample.tbl$R2_file))
# Putting files into a vector, and plotting
fq.files <- c(file.path(data_folder, sample.tbl$R2_file[1]),
filt_R2_file[1])
fig <- plotQualityProfile(fq.files)
print(fig)
Put this into a separate script, since we do not want this plotting
each time we run the main dada2.R
script from earlier.
The main idea behind dada2
is to try to estimate the
level of sequencing error, and then use this for grouping the sequences.
Only sequences within sequencing error differences should belong to the
same group.
First, we need to actually estimate (=learn) how the quality scores
relate to differences in the sequences, i.e. estimate the rate of
substitutions from one nucleotide to another. This step will in general
take some time, and we therefore store the resulting objects in an
RData
file that we may load later, without re-computing
them. Add the following to the dada2.R
script from
above:
#################################
### Estimating error rate models
###
err.obj_R1 <- learnErrors(filt_R1_file, multithread = threads)
err.obj_R2 <- learnErrors(filt_R2_file, multithread = threads)
save(err.obj_R1, err.obj_R2, file = file.path(tmp_folder, "error.objects.RData"))
Note that we train an error rate model for the R1 and R2 reads separately, resulting in two different error objects.
The next step is to de-replicate, which is also done separately for
R1 and R2 reads. Add to the dada2.R
script:
###########################
### De-replicating reads
###
derep.obj_R1 <- derepFastq(filt_R1_file, verbose = T)
derep.obj_R2 <- derepFastq(filt_R2_file, verbose = T)
save(derep.obj_R1, derep.obj_R2, file = file.path(tmp_folder, "derep.objects.RData"))
Again we save the resulting objects in an RData
file in
case we want to inspect them later.
Finally, the denoising, which is the very essence of the
dada2
pipeline. Add to the dada2.R
script:
################
### Denoising
###
dada.obj_R1 <- dada(derep.obj_R1, err.obj_R1, multithread = threads)
dada.obj_R2 <- dada(derep.obj_R2, err.obj_R2, multithread = threads)
save(dada.obj_R1, dada.obj_R2, file = file.path(tmp_folder, "dada.objects.RData"))
The three steps above are the time consuming ones when running the
dada2
. In our small example data set it does not take super
long time, but with hundreds of samples, and many reads, this takes
hours (days). This is why we choose to store all resulting objects in
the tmp_folder
.
Add the code above to the dada2.R
script from earlier,
and sbatch
again. This will take some minutes to complete.
Verify you have the three .RData
files in your
tmp_folder
.
The function plotErrors()
may be used to plot the error
structure estimated by dada2
. Load the error objects from
above (use load(<RData-file>)
) and plot each of the
two objects with plotErrors()
. Read the help text to
understand what is displayed.
library(tidyverse)
library(dada2)
# From dada2 script
tmp_folder <- "/mnt/SCRATCH/larssn/dada2"
# loading RData file and plotting
load(file.path(tmp_folder, "error.objects.RData"))
r1.fig <- plotErrors(err.obj_R1)
print(r1.fig)
r2.fig <- plotErrors(err.obj_R2)
print(r2.fig)
Again, put this into a separate script, since we do not want this
plotting each time we run the dada2.R
script.
Down to here all processing has been done for R1 and R2 reads
separately. Only now will we merge them! Note the huge difference to the
vsearch
pipeline, where we merged the read-pairs almost
immediately. Add to the dada2.R
script:
#####################
### Merging reads
###
merged.obj <- mergePairs(dada.obj_R1, derep.obj_R1,
dada.obj_R2, derep.obj_R2,
verbose = T)
Let us now stop and inspect this, since this is a step which is somewhat difficult to follow.
The code below in this section you do not add to
your dada2.R
script, put this code somewhere else (another
R script) since it is something we do only here and now.
Let us focus on the very first sample only, with
sample_id
equal to sample_1A
. If we look into
the dada.obj_R1
and dada.obj_R2
objects from
above, we find there is one object for each sample inside them. For each
sample is listed among others something called denoised
.
This is the number of reads in each cluster. For the first sample it
is:
library(tidyverse)
library(dada2)
load("/mnt/SCRATCH/larssn/dada2/dada.objects.RData") # edit path
denoised_R1_readcounts <- dada.obj_R1$sample_1A$denoised
denoised_R2_readcounts <- dada.obj_R2$sample_1A$denoised
We notice for R1 reads there are 32 clusters with varying read
counts, while for R2 there are only 21 clusters. Since R2 reads
typically have more errors, this is perhaps not surprising. The
names
of each element in these vectors are the actual
sequences, which makes them horrible to display. Let us remove these
names
and then display them:
names(denoised_R1_readcounts) <- NULL
print(denoised_R1_readcounts)
names(denoised_R2_readcounts) <- NULL
print(denoised_R2_readcounts)
## [1] 4675 2024 1741 1511 1444 1365 1338 1064 846 867 684 361 341 295 299
## [16] 266 263 236 190 136 149 105 57 77 56 46 41 21 72 150
## [31] 70 38
## [1] 2215 4954 1665 1612 1392 1701 1517 902 793 1052 219 275 251 74 908
## [16] 175 273 361 122 244 64
Notice we have for the first sample 4675 denoised R1 reads, but only 2215 denoised R2 reads. There are 32 clusters with denoised R1 reads, but only 21 clusters of denoised R2 reads.
Let us now run the merging only, and we load()
the
previous results from the tmp_folder
to use as input:
load("/mnt/SCRATCH/larssn/dada2/dada.objects.RData") # edit path
load("/mnt/SCRATCH/larssn/dada2/derep.objects.RData") # edit path
merged.obj <- mergePairs(dada.obj_R1, derep.obj_R1,
dada.obj_R2, derep.obj_R2,
verbose = T)
Now, if we inspect the merged.obj
we find one component
for each sample, similar to what we saw for the dada objects above. The
vector
merged_readcounts <- merged.obj$sample_1A$abundance
## [1] 4632 1986 1640 1470 1391 1317 1200 1029 821 816 684 296 272 239 236
## [16] 172 131 72 68 44 22 7 5 4
contains the read counts for each merged cluster. Note that we have 24 clusters after merging. By merging the 32 R1 clusters with the 21 R2 clusters we end up with 24 merged clusters. How is this possible?
The vectors forward
and reverse
indicate
which cluster from R1 and R2 have been merged to form the resulting
merged clusters:
R1.idx <- merged.obj$sample_1A$forward
print(R1.idx[1:5]) # inspect the merging of first 5 clusters
R2.idx <- merged.obj$sample_1A$reverse
print(R2.idx[1:5]) # inspect the merging of first 5 clusters
## [1] 1 2 3 4 5
## [1] 2 1 3 6 5
We see that cluster 1 from R1 has been merged with cluster 2 from R2.
We saw from denoised_R1_readcounts
above that R1 cluster
number 1 (denoised_R1_readcounts[1]
) has 4675 reads, while
the corresponding R2 cluster (denoised_R2_readcounts[2]
)
has 4954. After merging it is 4632 reads. What happened to the leftover
reads from both clusters (R1 and R2)? This seems like a strange way of
merging.
If we look through the R1.idx
vector, we notice some
clusters are merged more than once, e.g. cluster 3 from R2 is merged
with three different R1 clusters!
The argument for merging after denoising is that we get a
safer merging once both the R1 and R2 reads have been denoised,
i.e. free from sequencing errors. But is this really the case? The
denoising is not perfect. We must expect to lose some reads in this
step, but this was the case also when merging was done early, like in
the vsearch
pipeline. However, the big issue with the
dada2
merging, and what seems strange, is that we seems to
lose all information about read pairs. When we start out, we
know that the R1 and R2 reads are arranged in the fastq files
in matching order. There is no such information left in the objects we
use to merge here. It feels like we are not using an obvious piece of
information here, but whether this has a substantial impact takes some
efforts to investigate.
Let us for now accept that the merging in dada2
is a
little strange, and return to the script for processing the reads.
The readcount table is made from the merged object using the
makeSequenceTable()
, and there is also a function for
filtering chimera right away. Add this to the dada2.R
script from above:
#############################################
### Readcount matrix and chimera filtering
###
readcount.mat <- makeSequenceTable(merged.obj)
readcount.mat <- removeBimeraDenovo(readcount.mat,
method = "consensus",
multithread = threads,
verbose = T)
The reason we name the resulting object readcount.mat
,
and not readcount.tbl
, is that this is a matrix, not a
table! It is also flipped/transposed compared to the table we made in
our vsearch
pipeline, i.e. samples are in the rows and
OTUs, or ASVs as they are denoted in dada2
language, are in
the columns. Also, instead of giving a short text identifier to each
ASV, the sequence itself is the ‘name’ of it! This makes it horrible to
inspect!
Let us immediately change this to a better format, and instead put
the sequences into a sequence.tbl
like we did for
vsearch
. Add this to the dada2.R
script:
##########################
### The sequence table
###
sequence.tbl <- tibble(Header = str_c("ASV_", 1:ncol(readcount.mat)),
Sequence = colnames(readcount.mat))
writeFasta(sequence.tbl, out.file = file.path(out_folder, "dada2_centroids.fasta"))
########################
### The readcount table
###
colnames(readcount.mat) <- sequence.tbl$Header
readcount.tbl <- t(readcount.mat) %>%
as_tibble(rownames = "ASV")
write_delim(readcount.tbl, delim = "\t", file = file.path(out_folder, "dada2_readcounts.txt"))
where we also transposed the matrix to have the ASVs in the rows,
just to have it in the same orientation as we did for
vsearch
.
To summarise, the dada2
also results in a
sequence.tbl
and a readcount.tbl
having the
same structure as with vsearch
, but both sequences and the
read count numbers will differ (slightly).
Having the complete script, run the dada2
again (using
the run_dada2.sh
script) and verify you get a folder named
dada2/
in your module10
folder, with a
centroid file and a read count table file, just as for
vsearch
.
After we used vsearch
we added a column to the sample
table, containing the number of read pairs in each sample after
processing (vsearch_readpairs
, see above).
Read the dada2
readcount table into R, and compute the
corresponding total read counts for each sample, and add this to the
sample table as column dada2_readpairs
Then, save this
extended version of the sample table to the file using the same file
name.
Compare the numbers from vsearch
and dada2
.
Is there a difference in how many reads that ‘survive’ the two
pipelines?
library(tidyverse)
sample.tbl <- read_delim("~/module10/sample_table.txt") # edit path
readcount.tbl <- read_delim("~/module10/dada2/dada2_readcounts.txt", delim = "\t") # edit path
sum.tbl <- readcount.tbl %>%
select(-ASV) %>%
summarise(across(everything(), sum)) %>%
pivot_longer(cols = everything(), names_to = "sample_id", values_to = "dada2_readpairs")
sample.tbl <- left_join(sample.tbl, sum.tbl, by = "sample_id")
write_delim(sample.tbl, delim = "\t", file = "sample_table.txt")
The data set we have been using here is special in the sense that we
happen to know exactly how the actual amplicons look like. This
is of course not the case in most cases, but it allows us to compare the
centroid sequences found by vsearch
and dada2
to the ‘gold standard’ sequences and see how close to the ‘truth’ they
came.
In the file $COURSES/BIN310/module10/amplicons.fasta
you
find the actual amplicons. Read this file into R and inspect it. You
find there are 27 different species, but some species have more than one
variant of the 16S amplicon. This is not uncommon. When we say that a
genome has several ‘copies’ of the 16S, they are not always identical
copies, but may have some small variation. Here we have in total 38
variants from the 27 species.
Read also the centroid (fasta) files we got from vsearch
and dada2
into R. Compute how many of their sequences have
an exact match to some of the amplicons. Hint: Use the
match()
function in R.
library(tidyverse)
### Read the sequences
amplicon.tbl <- readFasta("/mnt/courses/BIN310/module10/amplicons.fasta")
vsearch.tbl <- readFasta("vsearch/vsearch_centroids.fasta")
dada2.tbl <- readFasta("dada2/dada2_centroids.fasta")
## The vsearch comparison
vsearch.hit <- match(vsearch.tbl$Sequence, amplicon.tbl$Sequence)
cat("vsearch has", sum(!is.na(vsearch.hit)), "perfect matches to amplicons\n")
cat("The hits are:\n")
print(sort(amplicon.tbl$Header[vsearch.hit[!is.na(vsearch.hit)]]))
## The vsearch comparison
dada2.hit <- match(dada2.tbl$Sequence, amplicon.tbl$Sequence)
cat("dada2 has", sum(!is.na(dada2.hit)), "perfect matches to amplicons\n")
cat("The hits are:\n")
print(sort(amplicon.tbl$Header[dada2.hit[!is.na(dada2.hit)]]))
### Amplicons without perfect matches
with.hits <- sort(unique(c(vsearch.hit, dada2.hit)))
without.hits <- which(!(1:nrow(amplicon.tbl) %in% with.hits))
cat("Amplicons without hits:\n")
print(amplicon.tbl$Header[without.hits])
Remember that here we only consider perfect matches. It is likely that those OTUs/ASVs without perfect matches have a high similarity to some of the amplicons listed at the end above. To reveal this, we would have to make some BLASTing, like in assignment 2, or something along that line.
Whether we use vsearch
or dada2
or any
other procedure, the 16S amplicon data processing results in some
sequence.tbl
with all the representative sequences for each
OTU/ASV, and a corresponding readcount.tbl
. Even if it is
not strictly necessary, it is customary to assign some taxonomic labels
to each of these OTUs or ASVs, based on their sequences in the
sequence.tbl
.
We know from module 9 that in order to assign taxonomy to some sequence, we need some database of previously known sequences and their corresponding taxonomy. There are some huge repositories for 16S data, e.g. the SILVA, the RDP or the Greengenes. Note that while both SILVA and RDP contains full length 16S, the Greengenes2 has only amplicons. Thus, to use the latter you need to have amplified more or less the same region of the 16S as those in the database.
Again we face the potential problems due to a lack of an official taxonomy for prokaryotes. You will find that these databases may contain the same sequences, but assign them to different taxon names. Here is an article discussing this issue. Also, the naming changes over time and this makes the entire taxonomy problematic. Remember this when you work with taxonomy of prokaryotes.
Since our metabarcoding data are dealing with a smallish part of the 16S gene only, it is not possible to discriminate between too closely related organisms. The 16S gene is very conserved, and even if two genomes may differ a lot in their full genome content, they may still be extremely similar in their 16S genes. For this reason, it is difficult to classify such sequences reliably below the genus rank. We may be able to separate between species in some cases, but this also depends on how long our amplicons are, and which part of the 16S gene we have amplified.
Below we will use some tools for classification of 16S built into the
vsearch
and dada2
. Both need a fasta file with
known sequences having a known taxonomy, and where the taxonomy is given
in the Header line of each sequence. They format this slightly
different, but both follow a simple pattern that you easily see if you
inspect these files. For this reason it is not very difficult for you to
make your own database fasta file and use as input to both the tools
below.
vsearch
As part of the vsearch
tool we may classify 16S
sequences using the sintax
algorithm, see https://www.drive5.com/usearch/manual/cmd_sintax.html
for more details on this. This requires a 16S database formatted in a
special way, see the website https://www.drive5.com/usearch/manual/tax_annot.html.
We have downloaded a fasta file from this website with full length 16S
sequences in the proper format for sintax
, and you find it
in $COURSES/BIN310/module10/rdp_16s_v16.fa
. These are
sequences and taxonomy from the RDP repository mentioned above.
Here is a small script for doing the classification of the centroid
sequences we found by using vsearch
above:
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --reservation=BIN310
#SBATCH --account=bin310
#SBATCH --ntasks=1
#SBATCH --mem=10G
#SBATCH --time=01:00:00
#SBATCH --job-name=sintax
#SBATCH --output=sintax_%j.log
##############
### Settings
###
vsearch_exe="apptainer exec $HOME/module10/vsearch_latest.sif vsearch"
database_file=$COURSES/BIN310/module10/rdp_16s_v16.fa
centroids_file=$HOME/module10/vsearch/vsearch_centroids.fasta
sintax_cutoff=0.0
out_file=vsearch/vsearch_taxonomy.txt
#####################
### Running sintax
###
$vsearch_exe \
--sintax $centroids_file \
--db $database_file \
--sintax_cutoff $sintax_cutoff \
--tabbedout $out_file
Note how we only reserve --ntasks=1
, this is a small and
simple job when we have so few sequences to classify. The
sintax_cutoff
is a threshold for how ‘confident’ we should
be in order to classify. I like to set this at its lowest (0.0) here.
Then (almost) all sequences are given some classification, and the
confidence is also part of the output, and we can then read this into R
and decide there what this threshold should be, without the need to
re-run sintax
.
Make this script in your module10
folder and
sbatch
to get some result file.
The resulting file is a simple tab-delimited text file, without column names (inspect it). Make some R code for reading such files. Here is a list of what it should contain:
read_delim()
and with a tab
delimiter, and name the object taxonomy.tbl
. You should get
3 columns. Name them Header
, taxonomy
and
OK
, use the col_names
option.OK
column).domain
, phylum
, class
,
order
, family
and genus
instead
of the original taxonomy
. Use the function
separate_wide_delim()
for achieving this."d:"
in front of all domain names, and
similar for all the other ranks. Use mutate()
and inside
this str_remove()
. Hint: Learn how to use the
across()
inside mutate()
. This allows you to
perform the same operation on several columns, and saves a lot of
coding.genus_confidence
and put this
value there, as a numerical. We could do similar for all ranks, but for
now we are happy with genus only. Hint: Start with
str_extract(genus, "\\(.+")
inside a mutate()
statement, then remove the parentheses and convert to numeric. Note that
in a regular expression you need to use the double backslash
"\\("
to specify a left-parenthesis symbol, since
parentheses have a special meaning in regular expressions.Next, read in the centroids file (with readFasta()
) and
store it in sequence.tbl
as we did above, it has columns
Header
and Sequence
. Use
full_join()
to join this with the
taxonomy.tbl
.
Finally write this table to the same file you read in at the beginning here, i.e. you replace the original version of the file with this extended one.
It would perhaps be natural to add this R code to the shell script we made above. In this way the whole process is done inside one shell script each time.
library(tidyverse)
taxonomy_file <- 'vsearch/vsearch_taxonomy.txt'
centroids_file <- 'vsearch/vsearch_centroids.fasta'
taxonomy.tbl <- read_delim(taxonomy_file, delim = '\t',
col_names = c('Header', 'taxonomy', 'OK')) %>%
select(1:2) %>%
separate_wider_delim(cols = taxonomy, delim = ',',
names = c('domain', 'phylum', 'class', 'order', 'family', 'genus')) %>%
mutate(across(-Header, ~str_remove(., '^[dpcofg]:'))) %>%
mutate(genus_confidence = str_extract(genus, '\\(.+')) %>%
mutate(genus_confidence = str_remove_all(genus_confidence, '\\(|\\)')) %>%
mutate(genus_confidence = as.numeric(genus_confidence)) %>%
mutate(across(-Header, ~str_remove(., '\\(.+\\)')))
taxonomy.tbl <- readFasta(centroids_file) %>%
full_join(taxonomy.tbl, by = 'Header')
write_delim(taxonomy.tbl, delim = "\t", file = taxonomy_file)
dada2
The dada2
package also has a function for doing
taxonomic classification, named assignTaxonomy()
.
As a database fasta file, use
/mnt/courses/BIN310/module10/RefSeq-RDP16S_v2_May2018.fa
this time. This was downloaded from https://zenodo.org/record/2541239 and has the proper
format required by the dada2
function
assignTaxonomy()
.
Here is the R code needed for classifying the dada2
centroids we got earlier:
library(tidyverse)
library(microseq)
library(dada2)
############
### Settings
###
threads <- 10
database_file <- "/mnt/courses/BIN310/module10/RefSeq-RDP16S_v2_May2018.fa"
centroids_file <- "dada2/dada2_centroids.fasta"
out_file <- "dada2/dada2_taxonomy.txt"
########################
### Assigning taxonomy
###
sequence.tbl <- readFasta(centroids_file)
taxonomy.tbl <- assignTaxonomy(sequence.tbl$Sequence,
refFasta = database_file,
multithread = 10) %>%
as_tibble(rownames = "Sequence") %>%
full_join(sequence.tbl, by = "Sequence") %>%
select(Header, everything())
write_delim(taxonomy.tbl, delim = "\t", file = out_file)
Save this in a new R script (e.g. `dada2_taxonomy.R”).
The output from assignTaxonomy()
is a matrix with one
column for each taxonomic rank. The sequences are used a row names
(horrible)! For this reason, we immediately convert it to a table (with
as_tibble()
) and specify to put the row names in a column
named Sequence
. Then we can directly join this with the
sequence.tbl
, who also has a similar Sequence
column. In this way we get the OTU or ASV information into the table as
well. Finally, we write this to a tab-separated text file.
To run this, make use of the small shell script we used when running
dada2
itself. Just comment out the call to the
dada2
R script, and call upon this instead
(Rscript dada2_taxonomy.R
).
If you inspect the column names, note that here the first rank is
called Kingdom
. This is not strictly correct, it should be
either Superkingdom or Domain. We also notice this classifies down to
the species
rank. This depends on the database file used,
both methods we have seen here now will try to classify to species if
such information is available in the database files you supply.
In this case we know exactly which organisms were in the community. Let us compare the taxonomic classifications to this.
First, read the file
/mnt/courses/BIN310/module10/gold_standard.txt
into a
table. From the organism_name
column, collect the
genus name only and put this in a new column named
genus
. Select this column only, and keep only the distinct
rows, and arrange it alphabetically.
Then, read in the taxonomy tables we produced for both
vsearch
and dada2
above, select the
genus
columns only, and compare to the gold standard. How
many are correct? Are there any false positive and false negative
genera?
library(tidyverse)
gold.tbl <- read_delim("/mnt/courses/BIN310/module10/gold_standard.txt", delim = "\t") %>%
mutate(genus = word(organism_name, 1)) %>%
distinct(genus) %>%
arrange(genus) %>%
mutate(gold = genus)
vsearch_tax.tbl <- read_delim("vsearch/vsearch_taxonomy.txt", delim = "\t") %>%
distinct(genus) %>%
arrange(genus) %>%
mutate(vsearch = genus)
dada2_tax.tbl <- read_delim("dada2/dada2_taxonomy.txt", delim = "\t") %>%
distinct(Genus) %>%
arrange(Genus) %>%
mutate(dada2 = Genus)
all.tbl <- tibble(genus = unique(c(gold.tbl$genus, vsearch_tax.tbl$vsearch, dada2_tax.tbl$dada2))) %>%
full_join(gold.tbl, by = "genus") %>%
full_join(vsearch_tax.tbl, by = "genus") %>%
full_join(dada2_tax.tbl, by = c("genus" = "Genus"))
knitr::kable(all.tbl)
genus | gold | vsearch | dada2 |
---|---|---|---|
Bacteroides | Bacteroides | Bacteroides | Bacteroides |
Bacteroidia | Bacteroidia | NA | NA |
Blautia | Blautia | Blautia | Blautia |
Clostridiaceae | Clostridiaceae | NA | NA |
Coprococcus | Coprococcus | Coprococcus | Coprococcus |
Dorea | Dorea | Dorea | Dorea |
Enterococcus | Enterococcus | Enterococcus | Enterococcus |
Escherichia | Escherichia | NA | NA |
Faecalibacterium | Faecalibacterium | Faecalibacterium | Faecalibacterium |
Flavonifractor | Flavonifractor | Flavonifractor | Flavonifractor |
Fusicatenibacter | Fusicatenibacter | Fusicatenibacter | Fusicatenibacter |
Lachnospiraceae | Lachnospiraceae | NA | NA |
Parabacteroides | Parabacteroides | Parabacteroides | Parabacteroides |
Roseburia | Roseburia | Roseburia | Roseburia |
Ruminococcus | Ruminococcus | Ruminococcus | Ruminococcus |
[Eubacterium] | [Eubacterium] | NA | NA |
Clostridium_XlVa | NA | Clostridium_XlVa | NA |
Clostridium_sensu_stricto | NA | Clostridium_sensu_stricto | NA |
Lachnospiracea_incertae_sedis | NA | Lachnospiracea_incertae_sedis | NA |
Lactonifactor | NA | Lactonifactor | NA |
Pantoea | NA | Pantoea | NA |
Ruminococcus2 | NA | Ruminococcus2 | Ruminococcus2 |
Streptococcus | NA | Streptococcus | Streptococcus |
Clostridium XlVa | NA | NA | Clostridium XlVa |
Clostridium sensu stricto | NA | NA | Clostridium sensu stricto |
Escherichia/Shigella | NA | NA | Escherichia/Shigella |
Lachnospiracea incertae sedis | NA | NA | Lachnospiracea incertae sedis |
The gold standard has 16 distinct genera, and it looks like
vsearch
matches 10 and dada2
11 of them. But,
here we also see the hassle we meet when comparing names! Some names
seems to be the same, but has been written slightly different
(e.g. space instead of underscore). Always keep in mind that taxonomy
has an element of randomness in it. In fact, I think it would have been
better to only use the NCBI taxonomy numbers (tax_id
) that
we saw briefly in module 9. These numbers do not change as often as the
actual names, and there is only one way to write them…