We continue what we started in Module 8, to focus on metagenome data. This means we have sequenced all DNA in a sample from some environment that we are interested in. The reads we have in our fastq files are not from only one organism, but from several. We do not know how many, and how much each of these contribute to the total pool of DNA. We saw in Module 8 that if we try to re-construct the genomes in such a metagenome, we can typically only re-construct the most abundant ones, and even these may not be very accurately re-constructed. But even if we do not enough reads to re-construct the genomes, we may still detect what kind of bacteria they are. Which bacteria is in our sample, and how abundant are they? These are the questions we now focus on. In order to do this we need to do taxonomic classification of our data. We aim to
kraken2
and
bracken
tools.kraken2
and bracken
tools.kaiju
.gtdbtk
to classify entire
genomes, instead of reads.
As briefly mentioned in module 8, once we have sequenced a metagenome we are often interested in the composition of the microbial community, i.e. which organisms are there, and in what quantities. In many studies this information about the composition is of primary interest. Here are some examples:
In all these cases we need to put some name or label on each of the various organisms. A taxonomy is a way of organizing biology in a hierarchical structure and give names to the various organisms. There is no official prokaryotic taxonomy! Still, there is a consensus about the major structure:
Each ‘level’ in the taxonomy we refer to as a taxonomic rank. All named entities at all taxonomic ranks are taxa, i.e. a taxon may be a species but also a phylum.
When we talk about taxonomic classification we mean putting a read (or read-pair, or contig or full genome) into this taxonomy by assigning it to some taxon. We could for example consider an assembled genome, and say “this is a genome from the species Bacillus cereus”, or we could consider a read-pair and state “these reads come from the phylum Firmicutes”.
It should be noted that in biology, the word “classification” usually
refers to doing a taxonomic classification, while in statistics/machine
learning classification means assigning any objects to some pre-defined
categories, and is often synonymous to pattern recognition. Thus, the
data scientist/statisticians meaning of ‘classification’ is more
general, and doesn’t need to have anything to do with taxonomy. But in
both cases we classify by assigning some object to some
pre-defined class (=category).
As mentioned above, there is no official taxononomy for prokaryotes. However, most people relate to the NCBI Taxonomy that we find here. From the web-site you can search and explore the taxonomy, but when we have huge data sets we cannot rely on the manual clicking of web-pages.
Let us first spend some time to get to know the NCBI Taxonomy database content, and how it is organized. This is not necessary for using the software tools we will look at below, but it give us some insight into the taxonomy that may come in handy for processing the results we get from the classifiers, or to modify their databases.
The essential content of the NCBI Taxonomy database we can download from https://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz. I have downloaded this by the command (you do not need to do this)
wget https://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz
The file taxdump.tar.gz
that I downloaded is a
compressed archive, similar to a .zip
file. I put this file
in $/mnt/courses/BIN310/module9/NCBI
.
We extract the archive using the tar
command in
UNIX:
tar xvf taxdump.tar.gz
This will extract a number of files and store them all in the same
folder where the archive is. Have a look at the content of the folder
$COURSES/BIN310/module9/NCBI
to see the files we unpacked
(ls
this folder). There is a readme.txt
file
here, you may have a look at it with:
cat $COURSES/BIN310/module9/NCBI/readme.txt
This archive contains some files representing the entire database. There are 2 files we are especially interested in
names.dmp
nodes.dmp
Together, these two files describe the entire NCBI Taxonomy tree! Let us have a look at their content in the Terminal window (they are too big for opening in RStudio):
head $COURSES/BIN310/module9/NCBI/names.dmp # head display the first 10 lines of text
1 | all | | synonym |
1 | root | | scientific name |
2 | Bacteria | Bacteria <bacteria> | scientific name |
2 | bacteria | | blast name |
2 | eubacteria | | genbank common name |
2 | Monera | Monera <bacteria> | in-part |
2 | Procaryotae | Procaryotae <bacteria> | in-part |
2 | Prokaryotae | Prokaryotae <bacteria> | in-part |
2 | Prokaryota | Prokaryota <bacteria> | in-part |
2 | prokaryote | prokaryote <bacteria> | in-part |
We notice this looks like a table. There are some columns, separated
by the symbol |
. The other file, nodes.dmp
is
similar, but has many more columns:
head $COURSES/BIN310/module9/NCBI/nodes.dmp
1 | 1 | no rank | | 8 | 0 | 1 | 0 | 0|0 | 0 | 0 | |
2 | 131567 | superkingdom | | 0 | 0 | 11 | 0 |0| 0 | 0 | 0 | |
6 | 335928 | genus | | 0 | 1 | 11 | 1 | 0|1 | 0 | 0 | |
7 | 6 | species | AC | 0 | 1 | 11 | 1 | 0|1 | 1 | 0 | |
9 | 32199 | species | BA | 0 | 1 | 11 | 1 | 0|1 | 1 | 0 | |
10 | 1706371 | genus | | 0 | 1 | 11 | 1 | 0|1 | 0 | 0 | |
11 | 1707 | species | CG | 0 | 1 | 11 | 1 | 0|1 | 1 | 0 | effective current name; |
13 | 203488 | genus | | 0 | 1 | 11 | 1 | 0|1 | 0 | 0 | |
14 | 13 | species | DT | 0 | 1 | 11 | 1 | 0|1 | 1 | 0 | |
16 | 32011 | genus | | 0 | 1 | 11 | 1 | 0|1 | 0 | 0 | |
Let us read these files into R, to inspect them more properly.
Unfortunately, even if the files should have tab-separated columns, the
standard functions in R have problems reading these files (they only
read parts of them). To overcome this I have made some functions for
reading/writing such files. These are in an R package called
microclass
, but this is not the one you find on
CRAN. To install the latest version of this, do as follows:
microclass
In this package you should find the functions
read_names_dmp()
and read_nodes_dmp()
.
names.dmp
tableThe human gut species Ruminococcus gnavus has some impact on
human health. Let us look it up, and get to know the NCBI Taxonomy. Make
an R script and read the names.dmp
file using the
read_names_dmp()
function from microclass
(it
takes some seconds to read). Store it in the object
names.tbl
. There should be 4 columns.
tax_id
is a unique integer for each taxon. How many
unique tax_id
are there? Note that names are not unique to
each taxon, the same taxon (=tax_id
) may have several names
listed!"scientific name"
in the name_class
column).
How many rows are there now?name_txt
column. Use str_detect()
and store in
the table rum.tbl
. How many rows are there?tax_id
. Then filter the names.tbl
again, but
for this tax_id
instead.Add this to the script containing the functions above:
library(microclass)
names.tbl <- read_names_dmp("/mnt/courses/BIN310/module9/NCBI/names.dmp")
cat("The names.dmp file has", nrow(names.tbl), "rows\n")
cat("There are", length(unique(names.tbl$tax_id)), "unique tax_id's\n")
names.tbl <- names.tbl %>%
filter(name_class == "scientific name")
cat("The table with only scientific names has", nrow(names.tbl), "entries\n")
rum.tbl <- names.tbl %>%
filter(str_detect(name_txt, "Ruminococcus"))
head(rum.tbl)
tax_id_33038.tbl <- names.tbl %>%
filter(tax_id == 33038)
cat("The proper scientific name is", tax_id_33038.tbl$name_txt[1])
Let us see how we can get a table listing the taxonomy of all taxa who are clades descending from a particular taxon.
Extend the script above by also reading the file
nodes.dmp
into R in a similar way as above, store in
nodes.tbl
.
Notice only 3 columns were read, since these hold the essential
information. Inspect the table in R. Note there is one row for each
tax_id
(compare the number of rows to the number of unique
tax_id
s in the previous exercise). For each
tax_id
there is a rank
and also the
parent_tax_id
. This is the tax_id
of the taxon
at the rank above, i.e. the ‘parent’ in the hierarchy.
The microclass
package also has some functions for
handling the NCBI taxonomy. Let us explore this briefly, by adding more
code to the R script:
Retrieve the branch ending at tax_id
33038 by
branch.lst <- branch_retrieve(33038, nodes.tbl)
Inspect the list. Notice the name above each tax_id
value is the rank. To replace the tax_id
integers with the
corresponding names, run it through branch_taxid2name()
and
add the names.tbl
as the second argument (see
?branch_taxid2name
).
Let us find out how many species there are under the same genus as this organism.
First, find the tax_id
of the genus of this organism
(see the list from above). Next, use subset_clade()
to find
the clade that starts at this tax_id
. You need to supply
the nodes.tbl
, and will get a table (subset of
nodes.tbl
) listing all branches hanging from this
genus.
Filter this table to only list species (i.e. not strains or anything else).
Use branch_retrieve()
again, but use all
tax_id
s from the above table as input, together with
nodes.tbl
. Then run the resulting list through
branch_taxid2name()
to replace tax_id
s with
proper names, and finally use branch_list2table()
to get it
as a table instead of a list. You should end with a table where each row
(38 rows?) lists all species having the same genus as the organism we
started out with.
library(microclass)
names.tbl <- read_names_dmp("/mnt/courses/BIN310/module9/NCBI/names.dmp") # skip if already read
nodes.tbl <- read_nodes_dmp("/mnt/courses/BIN310/module9/NCBI/nodes.dmp")
### Retrieving branch ending at 33038
branch.lst <- branch_retrieve(33038, nodes.tbl)
print(branch.lst)
branch_taxid2name(branch.lst, names.tbl)
n.tbl <- subset_clade(2316020, nodes.tbl) %>%
filter(rank == "species")
species.tbl <- branch_retrieve(n.tbl$tax_id, nodes.tbl) %>%
branch_taxid2name(names.tbl) %>%
branch_list2table()
View(species.tbl)
From this we first learn that Ruminococcus gnavus is actually denoted [Ruminococcus] gnavus in the taxonomy, and listed under genus Mediterraneibacter. Thus, we may perhaps expect it change name to Mediterraneibacter gnavus in the future. This is what you must learn to live with in prokaryotic taxonomy. Names change, and organisms are re-placed into new branches as we get a gradually clearer picture of this landscape.
The NCBI taxonomy covers all organisms, not only prokaryotes. Look up
Homo sapiens
and find all the taxa in the branch of our
species! How many species are listed under the same genus as us?
library(microclass)
names.tbl <- read_names_dmp("$COURSES/BIN310/module9/NCBI/names.dmp")
nodes.tbl <- read_nodes_dmp("$COURSES/BIN310/module9/NCBI/nodes.dmp")
homo.sapiens.tbl <- names.tbl %>%
filter(name_txt == "Homo sapiens")
homo.sapiens.lst <- branch_retrieve(homo.sapiens.tbl$tax_id[1], nodes.tbl) %>%
branch_taxid2name(names.tbl)
print(homo.sapiens.lst)
n.tbl <- subset_clade(9605, nodes.tbl) %>% # 9605 is the tax_id for genus Homo
filter(rank == "species")
species.tbl <- branch_retrieve(n.tbl$tax_id, nodes.tbl) %>%
branch_taxid2name(names.tbl) %>%
branch_list2table()
View(species.tbl)
Notice the huge number of ranks in the branch of Homo sapiens. There are 4 species listed under Homo, but should it not be more?
Let us again start by having sequenced a microbial community, and we have some fastq file(s) of reads. From this, want to we say something about
How should we approach this? Should we try to assemble the reads into longer contigs, and then try to find out what taxon each contig is from? This is almost what we did in module 8 (assembly+binning). It might sound like a good idea since it should be easier to ‘see’ what taxon a longer sequence is from, compared to a (short) read. However, we saw that only the more abundant genomes will be (partly) re-constructed this way. Even if the lower-abundant taxa in there do not have enough reads to be (partly) re-constructed, it may still be possible to detect their presence, and estimate their abundance.
When composition is the main focus, it is more common to classify the reads directly, without any attempt to assemble them first. Once we start to assemble, we also run the risk of reads from different taxa ending up in the same contig (misassemblies), obscuring the classification.
A big difference to the binning we saw in module 8 is the existence of a database. Binning is a clustering of contigs, we never assign them to pre-defined categories, but make up the “categories” (=bins) as we go along. Note that when we start binning we never how many, or which, bins we will find. It is unsupervised learning, for those familiar with machine learning terminology.
Taxonomy classification is supervised learning, which means we have some pre-defined classes (=taxa) that we want to recognize, i.e. we have a database of sequences whose taxa we already know. In order to recognize a read from, say, E. coli we need to have seen some E. coli sequences before, and know what to look for. If our database does not contain any E. coli sequences, we cannot recognize it. Thus, all the metagenome classification methods work along the lines of
It should be apparent that the database of sequences becomes very important for the success of our classification.
In many cases we might typically find that a (short) read is not specific enough to match one and only one taxon. A common approach is then to use what is named the Least Common Ancestor (LCA) algorithm. This means that
As an example, a read matches both Escherichia coli and
Salmonella enterica, and we cannot assign to neither species.
We move up to the genus rank, but the read matches both genus
Escherichia and Salmonella equally well. We then moved
up to the family rank, and find both are Enterobacteriaceae,
and we assign the read to this family. This means some reads are
assigned to species, some to genera, some to families, and so on. This
LCA approach has become a standard, and most taxonomy classifiers use
it.
kraken2
and bracken
One of the popular tools for metagenome classification is
kraken2
. It is also quite common to post-process the
results we get from this tool with the software bracken
to
get the final results.
Let us first have a look at the original kraken
paper.
The kraken2
has been extended slightly, but mostly to
obtain better speed and memory usage, and the original paper describes
the idea we are interested in: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4053813/.
Again, we focus on the big picture in the Materials and Methods section, supported by Figure 1.
kraken
?As mentioned above, it is quite common to post-process the
kraken2
results with bracken
. The paper
describing the bracken
software is found here: https://peerj.com/articles/cs-104/. Let us focus
on
In this video I have tried to see the big picture on how
these two methods work as one.
kraken2
and using enough memoryLet us run the kraken2
software, and try to classify the
Illumina reads we worked on in module 8. Here is the shell script we
start out with:
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --reservation=BIN310
#SBATCH --account=bin310
#SBATCH --ntasks=10
#SBATCH --mem=50G
#SBATCH --time=01:00:00
#SBATCH --job-name=kraken2
#SBATCH --output=kraken2_%j.log
##############
### Settings
###
kraken_dbase=/mnt/databases/kraken2/Standard
threads=10
out_dir=$SCRATCH/kraken2
krk2_report=$out_dir/kraken2_report.txt
r1=$COURSES/BIN310/module8/metagenome30_R1.fastq.gz
r2=$COURSES/BIN310/module8/metagenome30_R2.fastq.gz
if [ ! -d $out_dir ]
then
mkdir $out_dir
fi
#####################
### Running kraken2
###
apptainer exec $HOME/module9/kraken2:2.1.3--pl5321hdcf5f25_0.sif kraken2 \
--threads $threads --use-names --db $kraken_dbase --report $krk2_report \
--paired $r1 $r2 > $out_dir/delete_me.txt
Briefly on the content of this script: We reserve 50 gigabytes,and
hope this is enough. When running kraken2
we specify
--use_names
in order to have names in addition to the
tax_id
s in the output. The standard kraken2
database uses the NCBI taxonomy, by the way.
We also specify --report $report_file
to get a summary
file of the results. The last output, that we redirect to
$out_dir/delete_me.txt
is not something we will look into.
This lists the results for each single read-pair, and we just delete
this file (hence the name!) since we already have the summary report
file with all the results we need.
Since we have paired-end reads we specify the --paired
option, resulting in each read-pair being classified. We could
also have supplied single read files (e.g. nanopore data), just remove
the --paired
option and specify one file with reads instead
of two.
Make the shell script based on the code above, save it in your
module9
folder and sbatch
.
You should now find that this actually crashes! Inspect the log-file, here is what I find:
Loading database information.../var/tmp/slurmd/job12875115/slurm_script: line 31: 26810 Killed apptainer exec kraken2:2.1.3--pl5321hdcf5f25_0.sif kraken2 --threads $threads --use-names --db $kraken_dbase --report $krk2_report --paired $r1 $r2 > $out_dir/delete_me.txt
slurmstepd: error: Detected 3 oom_kill events in StepId=12875115.batch. Some of the step tasks have been OOM Killed.
Whenever you see an oom_kill event
related to an error
in your log-file, it indicates too little memory was reserved (OOM = Out
Of Memory). OK, so we need to use more memory, but how much more? It is
most likely the size of the database that is critical here. Inspect the
folder where we have the database (see path above):
ls -alh /mnt/databases/kraken2/Standard
The file hash.k2d
is the database that is loaded when we
run kraken2
. From my listing this is 67GB. We also need
some memory to read the fastq files, let us try to increase the memory
to 75GB, and re-run. This should be enough to give us some results.
Always check the size of the database to get an idea of how much memory
you need.
We will extend this script shortly, and don’t delete any of the
results in the $SCRATCH
folder yet…
Let us read the report from kraken2
into R, and have a
look (remember to edit the path below to read your file):
library(tidyverse)
krk.tbl <- read_delim("/mnt/SCRATCH/larssn/kraken2/kraken2_report.txt", delim = "\t",
col_names = c("percent", "clade_count", "tax_count", "rank", "tax_id", "name"),
trim_ws = T)
The column names used here are my own ‘invention’, the file has no column names.
The first column is a percent of the total number of read pairs, and
columns clade_count
and tax_count
are the read
counts. The clade_count
is the number of read pairs
assigned to all taxa in the clade starting at the given taxon, while
tax_count
are the reads assigned to this specific taxon.
This means the clade_count
s are always rather large values
when we are high up in the hierarchy, since these are the sum of all
reads assigned to all taxa in the branches below that taxon.
The rank
is given with a single letter as
Letter | Meaning |
---|---|
U | Unclassified |
R | root |
K | Kingdom |
D | Domain |
P | Phylum |
C | Class |
O | Order |
F | Family |
G | Genus |
S | Species |
Note that ‘root’ is not really a rank, it just symbolizes the start
of the taxonomy tree (“some living organism”). You also find some ranks
with an integer, e.g. S1
. This means a sub-rank, e.g. a
C1
would mean some sub-Class, which is an intermediate rank
between Class and Order. The S1
is a category below species
(e.g. strain or sub-species).
Make an R script with the R code above for reading the
kraken2
report file. Remember to edit the path to read
your report file. Inspect the krk.tbl
in RStudio.
How large fraction of the read pairs did kraken2
not
recognize (unclassified)?
How many different taxa did kraken2
assign some reads
to? This is the same as the number of rows where tax_count
has a value larger than 0.
How many read pairs did kraken2
assign to each rank?
First, to avoid all the sub-ranks, mutate the rank
texts
such that the numbers are removed, e.g. replace all S1
by
just S
and so on. Use str_remove()
. To specify
the pattern to remove, use [0-9]+
, which is a pattern that
matches all positive integers. Next, use group_by(rank)
and
then a summarise(tax_count = sum(tax_count))
to count the
number of read pairs for each rank
.
Make a bar chart of the results, and order the ranks in the order you see in the table above. How many read-pairs are assigned to the various ranks?
library(tidyverse)
krk.tbl <- read_delim("/mnt/SCRATCH/larssn/kraken2/kraken2_report.txt", delim = "\t",
col_names = c("percent", "clade_count", "tax_count", "rank", "tax_id", "name"),
trim_ws = T)
cat("The fraction of unclassified reads is", krk.tbl$percent[1] / 100, "\n")
cat("kraken2 assigns reads to", sum(krk.tbl$tax_count > 0) - 1, "taxa\n")
rank.tbl <- krk.tbl %>%
mutate(rank = str_remove_all(rank, "[0-9]+")) %>%
group_by(rank) %>%
summarise(tax_count = sum(tax_count)) %>%
mutate(rank = factor(rank, levels = c("U", "R", "K", "D", "P", "C", "O", "F", "G", "S")))
fig <- ggplot(rank.tbl) +
geom_col(aes(x = rank, y = tax_count))
print(fig)
bracken
on the kraken2
outputIf you did the last exercise above, you will see that not all reads
are assigned to the lowest rank, Species. Even if all the genomes in the
kraken2
database has some species name, some reads are
assigned at higher ranks due to the LCA algorithm (see above). The
bracken
software has been developed to re-assign reads from
these higher ranks down to the lower ranks, in order to get a complete
result for the composition at one selected rank, e.g. at Species.
To run bracken
we typically just add some code to the
existing kraken2
script:
##############
### Settings
###
brk_report=$out_dir/bracken_report.txt
target_rank=S
#####################
### Running bracken
###
apptainer exec bracken:2.8--py39hc16433a_0.sif bracken \
-d $kraken_dbase -i $krk2_report -o $brk_report -l $target_rank -r 100 -t 1
#########################
### Copying and cleaning
###
mv $brk_report $(basename $brk_report)
rm $out_dir/delete_me.txt
Here we tell bracken
to aggregate reads at the Species
rank ($target_rank=S
). In the command line we use the
option -l $target_rank
to specify this. You may aggregate
to any rank, using the one-letter codes for the various ranks we saw in
the small table above, but we focus on Species now.
The output file bracken_report.txt
is what we are
interested in, rather than the kraken2_report.txt
we
inspected above, and we copy this back to the folder we are working in
(The module9
folder).
bracken
resultsAdd the bracken
code to the kraken2
script,
and re-run the entire script, using first kraken2
and then
bracken
.
Read the file brk_report.txt
into R, using this
code:
library(tidyverse)
brk.tbl <- read_delim("bracken_report.txt", delim = "\t") %>%
rename(tax_id = taxonomy_id, rank = taxonomy_lvl, kraken2_reads = kraken_assigned_reads,
bracken_reads = new_est_reads, bracken_abundance = fraction_total_reads) %>%
arrange(desc(bracken_abundance))
The table brk.tbl
are the final classification results,
giving us the composition of Species in this case. Here we can now see
how many reads have been classified to the various species (the
bracken_read
column) and how large fraction this is of the
total number of reads (bracken_abundance
column).
In this metagenome we know for certain there are exactly 30 species.
How many species are detected in total?
How many of these have been assigned more than 0.1% of the reads?
library(tidyverse)
brk.tbl <- read_delim("bracken_report.txt", delim = "\t") %>%
rename(tax_id = taxonomy_id, rank = taxonomy_lvl, kraken2_reads = kraken_assigned_reads,
bracken_reads = new_est_reads, bracken_abundance = fraction_total_reads) %>%
arrange(desc(bracken_reads))
cat("Total number of species:", nrow(brk.tbl), "\n")
cat("Number of species with at least 0.1% abundance:", nrow(brk.tbl %>% filter(bracken_abundance >= 0.001)), "\n")
Notice how we
kraken2
to classify reads to taxa…bracken
on the kraken2
results
to aggregate all reads to some specified rank.The results in bracken output file are estimates of
The list of species is very long! This is typical for this, and many
other, taxonomic classification methods. Notice most of the species have
very few reads assigned, i.e. they make up a tiny fraction of the total
(last column). If we count all these species as ‘positives’ (species
found in our metagenome), we probably have a huge number of false
positives. It is very likely that most of the species with very low
read count are just artefacts of the sequencing and the method we used.
Remember we classify every single read pair. Reads with some sequencing
error, coming from species A, may then very well match a closely related
species B even better. Thus, listing all species with some tiny read
count as present in the metagenome will result in many false positives.
Often we simply use a threshold at some low abundance, and report
something like: “We found x species with abundance above y% in the
metagenome”. Species with lower abundances may be in there, but probably
have little impact on the microbial community anyway.
kraken2
and bracken
databaseThe kraken2
software allows us to make our own database,
and use this for classification. This is not something we will dig into
here, but it should be mentioned, since this turn out quite important
for all practical uses. You will never recognize a taxon which is not in
the database! Also, if one branch in the taxonomy has a lot of genomes
in the database, and another has few, this tend to affect the results we
get.
Making custom databases for various types of environments is therefore important, to get as good classifications as possible. Some of us at KBM published such a resource for studying the human gut environment some time ago, see this scientific publication in Microbiome and also an article about this in Forskning.no. Note how this last article includes pictures of scientists in white lab coats, while in reality this was a pure dry-lab work, i.e. 100% bioinformatics!
Once you have a kraken2
database, the
bracken
also has commands for building its own ‘database’
inside the kraken2
database folder. If you inspect the
folder at /mnt/databases/kraken2/standard
(that we used
above) you can see the files it contains. The raw data (genome fasta
files) that were used have been deleted, since these takes up a lot of
disk space.
How good are the classifications we get, from the methods above or any other methods?
To answer this, we first need some gold standard data sets where we know exactly
Such data sets exist. We can simulate data sets, by mixing some
genomes and then use sequencing simulation software to produce the
reads. This is exactly what we have done to produce the data set we used
in module 8 (the metagenome30
fastq files). However,
simulated data are often slightly ‘nicer’ than real data, and the
results we get are probably too good to reflect real life results.
Another, and more expensive and cumbersome approach, is to construct mock communities. This means we mix the organisms to known relative quantities in vitro in the lab, and then sequence this using the normal procedures. Such data are very close to real life metagenome data, but the actual compositions are not as precisely determined as for simulated data, since it is difficult to control this perfectly in vitro. Also, such mock communities always have rather few taxa (for practical reasons), and in this way do not resemble real data where we may have hundreds or even thousands of species.
Anyway, results of the type we got from bracken
above
must then be compared to this gold standard in some way, and there are
several statistics we may compute indicating how good/bad the results
turned out.
Let us look at the following paper on how to compare the output from such classification tools to a gold standard: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1646-y. Let us focus on the section Metrics and accompanying visualizations.
Here is a short video explaing the basic statistics we will use.
In the tab-delimited file
$COURSES/BIN310/module9/gold_standard.txt
we have the
actual composition of species in the metagenome30
data set.
Let us read this into R:
gold.tbl <- read_delim("/mnt/courses/BIN310/module9/gold_standard.txt", delim = "\t")
head(gold.tbl)
The third column is the actual abundance for each taxon, and if you
sum all these values they sum to 1.0
(sum(gold.tbl$true_abundance)
).
To compare the bracken
results (or any other result) to
this, we need to add such abundances as a new column, and make
certain
1.0
Once we have such a table, we can compare the bracken
results to the gold standard. If we have other results, from other
methods, we could add these as well, and compare these to the gold
standard in the exact same way.
bracken
results to the gold standardMake an R script where you first read the gold standard as above, but
select only the columns tax_id
and
true_abundance
.
Then, read the bracken
results into another table, as we
did above, and again, select only the columns tax_id
and
bracken_abundance
.
Notice both tables have a column tax_id
with the same
type of information. Next, join the two tables by the
tax_id
column as the joining information, use
full_join(___, ___, by = "tax_id")
and store this in
joined.tbl
.
After such a joining, we will typically find some NA
s
(missing values) in the columns listing abundances. This is because the
tax_id
s in the table now include all those from both
tables. If a tax_id
was found in only one the the tables we
joined, the columns coming from the other table will be filled with
NA
in the corresponding row. All these NA
should be replaced by 0
. In this case a missing value is
the same as zero abundance. You must not infer from this that
NA
always means the value 0
! Here
NA
means ‘not counted’ which is the same as ‘counted zero
times’. If you measured something, e.g. lengths, a missing measurement
does not mean you measured the length zero! To replace all
NA
with 0
use
joined.tbl <- joined.tbl %>% replace(is.na(.), 0)
.
Based on the joined.tbl
, compute the \(L_1\) norm between the abundances predicted
by bracken
and the gold standard abundances. See formula in
the OPAL paper and the video above.
We would like to compute precision and recall from such results. Here is a function for computing the precision:
precision <- function(true_abd, predicted_abd, threshold = 0){
actual <- which(true_abd > threshold)
predicted <- which(predicted_abd > threshold)
TP <- sum(actual %in% predicted)
FP <- sum(!(predicted %in% actual))
return(TP / (TP + FP))
}
Here true_abd
must be a vector (=column) with the gold
standard abundances, and predicted_abd
another vector with
some abundances from some method, e.g. bracken
. The
threshold
is the abundance limit between Presence and
Absence of a taxon, i.e. an abundance as low as this threshold is
considered the same as Absent from the metagenome. It is by default
0
here (as in the OPAL paper), but the user may choose a
higher threshold. Add this function to your R script, and use it to
compute the precision for thresholds 0
and then for
threshold 0.001
.
Try to make a similar function for computing the recall
,
by copying and modifying the precision
function above. Then
use it to compute the recall for the same bracken
results,
using the same two thresholds as above.
library(tidyverse)
### Solution to first exercise
gold.tbl <- read_delim("/mnt/courses/BIN310/module9/gold_standard.txt", delim = "\t") %>%
select(tax_id, true_abundance)
brk.tbl <- read_delim("bracken_report.txt", delim = "\t") %>%
rename(tax_id = taxonomy_id, rank = taxonomy_lvl, kraken2_reads = kraken_assigned_reads,
bracken_reads = new_est_reads, bracken_abundance = fraction_total_reads) %>%
select(tax_id, bracken_abundance)
joined.tbl <- full_join(gold.tbl, brk.tbl, by = "tax_id") %>%
replace(is.na(.), 0)
### Solution to second exercise
L1.norm <- sum(abs(joined.tbl$true_abundance - joined.tbl$bracken_abundance))
cat("BRACKEN RESULTS:\n")
cat("The L_1 norm is", L1.norm, "\n")
precision <- function(true_abd, predicted_abd, threshold = 0){
true <- which(true_abd > threshold)
predicted <- which(predicted_abd > threshold)
TP <- sum(true %in% predicted)
FP <- sum(!(predicted %in% true))
return(TP / (TP + FP))
}
recall <- function(true_abd, predicted_abd, threshold = 0){
true <- which(true_abd > threshold)
predicted <- which(predicted_abd > threshold)
TP <- sum(true %in% predicted)
FN <- sum(!(true %in% predicted))
return(TP / (TP + FN))
}
pres.0 <- precision(joined.tbl$true_abundance, joined.tbl$bracken_abundance)
rec.0 <- recall(joined.tbl$true_abundance, joined.tbl$bracken_abundance)
cat("With threshold 0: Precision =", pres.0, " and Recall =", rec.0, "\n")
pres.0.001 <- precision(joined.tbl$true_abundance, joined.tbl$bracken_abundance, threshold = 0.001)
rec.0.001 <- recall(joined.tbl$true_abundance, joined.tbl$bracken_abundance, threshold = 0.001)
cat("With threshold 0.001: Precision =", pres.0.001, " and Recall =", rec.0.001, "\n")
kaiju
softwareLet us have a look at an alternative to
kraken2
/bracken
called kaiju
.
What is special about kaiju
is that it uses protein
sequences for classification. Thus, reads are translated into amino
acid sequences, in all six reading frames, and then compared to a
database of proteins. This comparison is done by the use of the
Burrows-Wheeler transform, that we first saw in module 5.
The major idea behind the use of proteins for taxonomic
classification is to get some results even if our data are from
completely new environments, where we have very few good reference
genomes in our databases. A tool like kraken2
requires that
the database contains, more or less, the same taxa that we have been
sequencing. This was also observed by the people behind
kaiju
when they analyzed data from some hot springs, where
the microbial community contains some taxa that are poorly represented
among sequenced genomes. A tool like kraken2
will then
leave the vast majority of reads as unclassified. However, proteins are
more conserved, and even if the genomes are new to us, the proteins are
often similar. Then, instead of leaving a vast number of reads as
unclassified, we may perhaps still give them some taxonomic assignment
which is perhaps not exactly correct, but might still give us a good
picture of what it resembles among the genomes we do know.
I think we can say that
kaiju
if your data are from poorly studied
environments, where you expect to find many genomes with poor match to
already sequenced genomes (e.g. soils, water, air etc).kaiju
tend to have a lower
resolution, i.e. we cannot expect to separate closely related taxa,
since their proteins are even more similar than their genomes.
kaiju
Let us run kaiju
on our metagenome30
data,
just like we did for kraken2
:
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --reservation=BIN310
#SBATCH --account=bin310
#SBATCH --ntasks=10
#SBATCH --mem=130G
#SBATCH --time=01:00:00
#SBATCH --job-name=kaiju
#SBATCH --output=kaiju_%j.log
#############
### Settings
###
threads=10
r1=$COURSES/BIN310/module8/metagenome30_R1.fastq.gz
r2=$COURSES/BIN310/module8/metagenome30_R2.fastq.gz
kaiju_dbase=/mnt/databases/kaiju/nr_NCBI/
kaiju_fmi=$kaiju_dbase/kaiju_db_nr_euk.fmi
kaiju_nodes_dmp=$kaiju_dbase/nodes.dmp
kaiju_names_dmp=$kaiju_dbase/names.dmp
out_dir=$SCRATCH/kaiju
kaiju_report=$out_dir/kaiju_report.txt
target_rank=species
if [ ! -d $out_dir ]
then
mkdir $out_dir
fi
##################
### Running kaiju
###
apptainer exec $HOME/module9/kaiju:1.9.2--h5b5514e_1.sif kaiju \
-v -z $threads -t $kaiju_nodes_dmp -f $kaiju_fmi -i $r1 -j $r2 -o $out_dir/delete_me.txt
#####################################
### ...and aggeregating to
### target rank
###
apptainer exec $HOME/module9/kaiju:1.9.2--h5b5514e_1.sif kaiju2table \
-t $kaiju_nodes_dmp -n $kaiju_names_dmp -r $target_rank -o $kaiju_report $out_dir/delete_me.txt
###############
### Cleaning
###
cp $kaiju_report $(basename $kaiju_report)
Here we first run kaiju
using the fastq files in
$r1
and $r2
as input, and producing the output
file $out_dir/delete_me.txt
. Then, we use the tool
kaiju2table
, which is also supplied in the same container,
to aggregate the read counts to some target rank. Here we have chosen
the rank species
as we did for bracken
. This
is perhaps a little too low in the taxonomy, since it is difficult to
separate between species based on proteins only. It is perhaps more
common to aggregate to the rank genus
(or even higher) when
using this tool. We are only interested in the file
$kaiju_report
that is produced after this step, and this we
copy to our module9
folder before deleting the entire
$out_dir
.
The database we used above is the nr
database from NCBI,
and the file kaiju_db_nr_euk.fmi
is the Burrows-Wheeler FM
index for this database. We notice that kaiju
also uses the
NCBI Taxonomy database, and has its version of the files
names.dmp
and nodes.dmp
that we supply as
input. Notice we need to reserve much more memory than for
kraken2
. Inspect the database folder, and see how large the
FM index is!
Just like kraken2
, the kaiju
is supplied
with some ways of installing a number of databases, and the
nr
is the most comprehensive one. Unfortunately, it is not
as straightforward to customize your own database as it is for
kraken2
. But, perhaps the need for this is not as acute?
Since this is a tool we use to cover as widely as possible, the
nr
database seems to be the natural choice.
kaiju
results to the gold standardMake a shell script from the code above, and sbatch
to
get results from kaiju
.
Let us extend the R script where we compared the bracken
results to the gold standard. Read the kaiju
report file
(it is also a tab-delimited text file), select and rename two of the
columns by
select(tax_id = taxon_id, kaiju_abundance = percent)
. Note
how we can both select and rename a column in one go.
Then compute relative abundances as
mutate(kaiju_abundance = kaiju_abundance / sum(kaiju_abundance))
and store the results in kaiju.tbl
.
Join this table to the joined.tbl
as we did above
(joining by "tax_id"
). Compare these results to the gold
standard in a similar way to what we did for the bracken
results above (\(L_1\) norm, precision
and recall). Are these results closer to the gold standard?
This is the full script after we add this new code to the script from
above, where we read the gold standard and bracken
results:
library(tidyverse)
### Read gold standard
gold.tbl <- read_delim("/mnt/courses/BIN310/module9/gold_standard.txt", delim = "\t") %>%
select(tax_id, true_abundance)
## Read bracken results, and join with gold standard
brk.tbl <- read_delim("bracken_report.txt", delim = "\t") %>%
rename(tax_id = taxonomy_id, rank = taxonomy_lvl, kraken2_reads = kraken_assigned_reads,
bracken_reads = new_est_reads, bracken_abundance = fraction_total_reads) %>%
select(tax_id, bracken_abundance)
joined.tbl <- full_join(gold.tbl, brk.tbl, by = "tax_id") %>%
replace(is.na(.), 0)
### Read kaiju results and join with the existing joined.tbl
kaiju.tbl <- read_delim("kaiju_report.txt", delim = "\t") %>%
select(tax_id = taxon_id, kaiju_abundance = percent) %>%
mutate(kaiju_abundance = kaiju_abundance / sum(kaiju_abundance))
joined.tbl <- full_join(joined.tbl, kaiju.tbl, by = "tax_id") %>%
replace(is.na(.), 0)
### Define functions to be used below:
precision <- function(true_abd, predicted_abd, threshold = 0){
true <- which(true_abd > threshold)
predicted <- which(predicted_abd > threshold)
TP <- sum(true %in% predicted)
FP <- sum(!(predicted %in% true))
return(TP / (TP + FP))
}
recall <- function(true_abd, predicted_abd, threshold = 0){
true <- which(true_abd > threshold)
predicted <- which(predicted_abd > threshold)
TP <- sum(true %in% predicted)
FN <- sum(!(true %in% predicted))
return(TP / (TP + FN))
}
### Results for bracken:
L1.norm <- sum(abs(joined.tbl$true_abundance - joined.tbl$bracken_abundance))
cat("BRACKEN RESULTS:\n")
cat("The L_1 norm is", L1.norm, "\n")
pres.0 <- precision(joined.tbl$true_abundance, joined.tbl$bracken_abundance)
rec.0 <- recall(joined.tbl$true_abundance, joined.tbl$bracken_abundance)
cat("With threshold 0: Precision =", pres.0, " and Recall =", rec.0, "\n")
pres.0.001 <- precision(joined.tbl$true_abundance, joined.tbl$bracken_abundance, threshold = 0.001)
rec.0.001 <- recall(joined.tbl$true_abundance, joined.tbl$bracken_abundance, threshold = 0.001)
cat("With threshold 0.001: Precision =", pres.0.001, " and Recall =", rec.0.001, "\n")
### Results for kaiju:
cat("KAIJU RESULTS:\n")
L1.norm <- sum(abs(joined.tbl$true_abundance - joined.tbl$kaiju_abundance))
cat("The L_1 norm is", L1.norm, "\n")
pres.0 <- precision(joined.tbl$true_abundance, joined.tbl$kaiju_abundance)
rec.0 <- recall(joined.tbl$true_abundance, joined.tbl$kaiju_abundance)
cat("With threshold 0: Precision =", pres.0, " and Recall =", rec.0, "\n")
pres.0.001 <- precision(joined.tbl$true_abundance, joined.tbl$kaiju_abundance, threshold = 0.001)
rec.0.001 <- recall(joined.tbl$true_abundance, joined.tbl$kaiju_abundance, threshold = 0.001)
cat("With threshold 0.001: Precision =", pres.0.001, " and Recall =", rec.0.001, "\n")
Once we have a table like joined.tbl
it is customary to
plot the abundances as stacked bar charts. However, we should not use
all taxa, since this make the plot messy. Add code to the script from
above to make the plot, and let us do this very step-by-step. You should
also run this code and inspect the results after each new step, to
understand how the code works:
joined.tbl
table to keep only the gold
standard taxa, i.e. the rows where the gold standard has positive
abundance. Name this new table pruned.tbl
.tax_id
to display in plots. The names are in the
original gold standard table. Read the gold standard table again, select
only the name
and tax_id
, and then
full_join()
with your pruned.tbl
, again using
tax_id
as the joining column.name
-column
to a factor
, and use the names as levels in the order they
appear. Try also to comment out this, and run the code to plot without
this statement, to see the difference. It has to do with the
ordering of the colors in the plot.pivot_longer()
to stack the abundances into a
single column named Abundances
, and with the column names
in a new column named Tool
. Name this table
long.tbl
.long.tbl
, with ggplot()
using
geom_col()
and map the fill
-colors to the
names.Here is a skeleton for this code:
pruned.tbl <- joined.tbl %>%
filter(___ > 0)
pruned.tbl <- read_delim("/mnt/courses/BIN310/module9/gold_standard.txt", delim = "\t") %>%
select(___, ___) %>%
full_join(___, by = "tax_id")
___ <- pruned.tbl %>%
mutate(name = factor(___, levels = name)) %>%
pivot_longer(cols = c(true_abundance, ___, ___),
values_to = "Abundances", names_to = "Tool")
fig <- ggplot(long.tbl) +
geom_col(aes(x = Tool, y = Abundances, fill = name), color = "black") +
coord_flip() +
theme(legend.position = "bottom", legend.key.width = unit(10, "points")) +
guides(fill = guide_legend(nrow = 10))
print(fig)
Fill in the missing parts, and run the code step-by-step. Inspect the results after each step, and try to understand what happens.
The plot should look something like this:
This plot reveals nothing about false positives, but it does illustrate to what extent the two methods give us true positives and false negatives, and especially if the abundances are close to that of the gold standard or not. A ‘perfect’ prediction of taxonomic composition would have all colored sectors of the exact same sizes as the gold standard.
Add to the script from the previous exercise:
pruned.tbl <- joined.tbl %>%
filter(true_abundance > 0)
pruned.tbl <- read_delim("/mnt/users/larssn/BIN310/coursedata/module10/gold_standard.txt", delim = "\t") %>%
select(tax_id, name) %>%
full_join(pruned.tbl, by = "tax_id")
long.tbl <- pruned.tbl %>%
mutate(name = factor(name, levels = name)) %>%
pivot_longer(cols = c(true_abundance, bracken_abundance, kaiju_abundance),
values_to = "Abundances", names_to = "Tool")
fig <- ggplot(long.tbl) +
geom_col(aes(x = Tool, y = Abundances, fill = name), color = "black") +
coord_flip() +
theme(legend.position = "bottom", legend.key.width = unit(10, "points")) +
guides(fill = guide_legend(nrow = 10))
print(fig)
In module 8 we did some assembly of metagenome reads, and then some
binning and quality check of the resulting contigs in order to
re-construct some of the more abundant genomes in the microbial
community. Once we have such MAGs (Metagenome Assembled Genomes), we
would also like to assign these to some taxon, i.e. do a taxonomic
classification of entire genomes.
gtdbtk
toolIn principle, we could use a tool like kraken2
, but this
is built for classifying a huge number of shorter reads. Having only a
few very long sequences (contigs), there are other approaches that is
more likely better, also taking into account the information about genes
in the MAG. One such tool is gtdbtk
. This will take a
genome (or a set of binned contigs) as input, and use the GTDB database
(https://gtdb.ecogenomic.org/) to assign this to some
taxon. Note that GTDB is not the same as the NCBI Taxonomy, and
some taxa may have different names and/or be in a different clade
here.
Let us use the software gtdbtk
to classify the MAGs we
re-constructed in module 8. In $COURSES/BIN310/module9/MAG/
you find a copy of the MAGs we found, or you may use your own results.
Download the container as usual.
In order to classify, the gtdbtk
must of course also
have a database. In many software tools there is an option where we
simply give the path to where the database is found on this system. Here
we have a different solution.
First, make the following script, save it in your
module9
folder and sbatch
:
#!/bin/sh
#SBATCH --nodes=1
#SBATCH --reservation=BIN310
#SBATCH --account=bin310
#SBATCH --ntasks=10
#SBATCH --mem=180G
#SBATCH --time=02:00:00
#SBATCH --job-name=gtdbtk
#SBATCH --output=gtdbtk_%j.log
##############
### Settings
###
threads=10
genomes_dir=$COURSES/BIN310/module9/MAG
sketch_file=$COURSES/BIN310/module9/gtdbtk_release207_v2.msh
out_dir=$SCRATCH/gtdbtk
###################
### Runing GTDBTk
###
apptainer exec $HOME/module9/gtdbtk:2.3.2--pyhdfd78af_0.sif gtdbtk classify_wf \
--genome_dir $genomes_dir \
--out_dir $out_dir \
--cpus $threads \
--extension fasta \
--mash_db $sketch_file
Notice we need quite a lot of memory since the database is large.
Notice also how we supply a mash
sketch file (see module 7)
here, the mash
software is obviously used here.
Notice we never specify any database. If you sbatch
this
you should find it crashes, and inspect the log-file. It should look
something like:
[2023-09-18 14:26:09] INFO: GTDB-Tk v2.3.2
[2023-09-18 14:26:09] INFO: gtdbtk classify_wf --genome_dir /mnt/courses/BIN310/module9/MAG --out_dir /mnt/SCRATCH/larssn/gtdbtk --cpus 10 --extension fasta --mash_db /mnt/courses/BIN310/module9/gtdbtk_release207_v2.msh
================================================================================
ERROR
________________________________________________________________________________
The 'GTDBTK_DATA_PATH' environment variable is not defined.
Please set this variable to your reference data package.
https://ecogenomics.github.io/GTDBTk/installing/index.html
================================================================================
[2023-09-18 14:26:09] ERROR: Controlled exit resulting from early termination.
It turns out we have to create an environment variable named
GTDBTK_DATA_PATH
, and this should contain the correct path
to the GTDB database on our system. The correct path is in our case
/mnt/databases/GTDBTK/release207_v2/
. In order to turn a
ordinary shell variable into an environment variable we use the
export
statement. Add the following lines to your script,
before you run gtdbtk
:
##################################
### Setting environment variable
### with path to database
###
export GTDBTK_DATA_PATH=/mnt/databases/GTDBTK/release207_v2/
Save, and sbatch
. Hopefully the computations are running
now!
In the previous module we assembled, binned and de-replicated the
bins to produce the 6 MAGs we now classified. The table of results from
that exercise is found in the file
/mnt/courses/BIN310/module9/MAG.tbl.RData
. Let us now add
the taxonomy information to this table. Make a script where you
load()
on the RData
file to load the
MAG.tbl
.gtdbtk.bac120.summary.tsv
)
you find in the output folder from the above script.In the summary file there are several columns you may have a look at,
but we will focus only on the first two: The user_genome
you may rename to MAG_id
, since this corresponds to that
column in the MAG.tbl
. The second column
(classification
) contains the predicted taxonomy. Now,
select these two columns only and join this to the MAG.tbl
such that the latter table also has some taxonomy for each MAG.
How do the species classification we got here correspond to the
truth? Read in the gold standard as we did above (in
/mnt/courses/BIN310/module9/gold_standard.txt
) and see if
the same species are listed among the most abundant taxa in that table.
Since the NCBI Taxonomy and the GTDB taxonomy is not identical, the
names may differ, but it could also be that the gtdbtk
mis-classifies some of the genomes.
First, to complete the MAG.tbl
with an additional
taxonomy (classification
) column:
library(tidyverse)
load("/mnt/courses/BIN310/module9/MAG.tbl.RData")
MAG.tbl <- read_delim("/mnt/SCRATCH/larssn/gtdbtk/gtdbtk.bac120.summary.tsv") %>%
select(MAG_id = user_genome, classification) %>%
right_join(MAG.tbl, by = "MAG_id")
View(MAG.tbl)
Next, to compare against the gold standard, add the lines:
gold.tbl <- read_delim("/mnt/courses/BIN310/module9/gold_standard.txt")
compare.tbl <- MAG.tbl %>%
mutate(name = word(classification, -1, sep = "_")) %>%
full_join(gold.tbl, by = "name")
It looks like 3 out of the 6 predicted species are found among the gold standard species. We also note that the MAG classified as Bifidobacterium vaginale may be some of the other Bifidobacterium species in the gold standard, and the same for Stenotrophomonas pavanii (Stenotrophomonas maltophilia in gold standard). Even if the exact species is not correct, the genus may be. We should keep an open mind with respect to this, since MAGs are in general uncertain in their nature, as we have seen.