1 Learning goals

We continue what we started in Module 8, to focus on metagenome data. This means we have sequenced all DNA in a sample from some environment that we are interested in. The reads we have in our fastq files are not from only one organism, but from several. We do not know how many, and how much each of these contribute to the total pool of DNA. We saw in Module 8 that if we try to re-construct the genomes in such a metagenome, we can typically only re-construct the most abundant ones, and even these may not be very accurately re-constructed. But even if we do not enough reads to re-construct the genomes, we may still detect what kind of bacteria they are. Which bacteria is in our sample, and how abundant are they? These are the questions we now focus on. In order to do this we need to do taxonomic classification of our data. We aim to

  • Get to know the NCBI Taxonomy database, and see how we by some simple R functions can search for taxonomic information.
  • Understand the ideas behind the kraken2and bracken tools.
  • Learn how to use the kraken2 and bracken tools.
  • Learn how to use the alternative tool kaiju.
  • Learn how to compare the results we get from the above methods to a gold standard.
  • Learn how we use a tool like gtdbtk to classify entire genomes, instead of reads.



1.1 Software used in this module





2 Taxonomy

As briefly mentioned in module 8, once we have sequenced a metagenome we are often interested in the composition of the microbial community, i.e. which organisms are there, and in what quantities. In many studies this information about the composition is of primary interest. Here are some examples:

  • We want to see how the composition of the human gut changes from ‘normal’ healthy people to people with some specific disease, e.g. Inflammatory Bowel Disease (medicine).
  • We want to see how the composition of the seafloor sediments changes from ‘normal’ sediments to those we find under the fish farms along the Norwegian coast (environmental surveillance).
  • We want to know how a ‘normal’ composition of the air in the Oslo subway system looks like, and how it fluctuates under normal circumstances, in order to detect the presence of any non-normality (anti-bioterrorism).
  • We want to see how the composition of a biological trace material varies between the various body fluids that can be secreted from the human body, in order to understand the activity behind the deposition of some crime scene trace material (forensics).

In all these cases we need to put some name or label on each of the various organisms. A taxonomy is a way of organizing biology in a hierarchical structure and give names to the various organisms. There is no official prokaryotic taxonomy! Still, there is a consensus about the major structure:

Figure 1. The major ranks in the hierarchical system of taxonomy. The rank Domain is sometimes also referred to as Superkingdom. Prokaryotes (bacteria and archaea) have no Kingdom, we usually go directly from Domain (Superkingdom) to phylum. Note that there may be other ranks in between these, these are the major ranks only.
Figure 1. The major ranks in the hierarchical system of taxonomy. The rank Domain is sometimes also referred to as Superkingdom. Prokaryotes (bacteria and archaea) have no Kingdom, we usually go directly from Domain (Superkingdom) to phylum. Note that there may be other ranks in between these, these are the major ranks only.

Each ‘level’ in the taxonomy we refer to as a taxonomic rank. All named entities at all taxonomic ranks are taxa, i.e. a taxon may be a species but also a phylum.

When we talk about taxonomic classification we mean putting a read (or read-pair, or contig or full genome) into this taxonomy by assigning it to some taxon. We could for example consider an assembled genome, and say “this is a genome from the species Bacillus cereus”, or we could consider a read-pair and state “these reads come from the phylum Firmicutes”.

It should be noted that in biology, the word “classification” usually refers to doing a taxonomic classification, while in statistics/machine learning classification means assigning any objects to some pre-defined categories, and is often synonymous to pattern recognition. Thus, the data scientist/statisticians meaning of ‘classification’ is more general, and doesn’t need to have anything to do with taxonomy. But in both cases we classify by assigning some object to some pre-defined class (=category).

2.1 The NCBI Taxonomy database

As mentioned above, there is no official taxononomy for prokaryotes. However, most people relate to the NCBI Taxonomy that we find here. From the web-site you can search and explore the taxonomy, but when we have huge data sets we cannot rely on the manual clicking of web-pages.

Let us first spend some time to get to know the NCBI Taxonomy database content, and how it is organized. This is not necessary for using the software tools we will look at below, but it give us some insight into the taxonomy that may come in handy for processing the results we get from the classifiers, or to modify their databases.

The essential content of the NCBI Taxonomy database we can download from https://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz. I have downloaded this by the command (you do not need to do this)

wget https://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz

The file taxdump.tar.gz that I downloaded is a compressed archive, similar to a .zip file. I put this file in $/mnt/courses/BIN310/module9/NCBI.

We extract the archive using the tar command in UNIX:

tar xvf taxdump.tar.gz

This will extract a number of files and store them all in the same folder where the archive is. Have a look at the content of the folder $COURSES/BIN310/module9/NCBI to see the files we unpacked (ls this folder). There is a readme.txt file here, you may have a look at it with:

cat $COURSES/BIN310/module9/NCBI/readme.txt

This archive contains some files representing the entire database. There are 2 files we are especially interested in

  • The text file names.dmp
  • The text file nodes.dmp

Together, these two files describe the entire NCBI Taxonomy tree! Let us have a look at their content in the Terminal window (they are too big for opening in RStudio):

head $COURSES/BIN310/module9/NCBI/names.dmp    # head display the first 10 lines of text
1       |       all     |               |       synonym |
1       |       root    |               |       scientific name |
2       |       Bacteria        |       Bacteria <bacteria>     |       scientific name |
2       |       bacteria        |               |       blast name      |
2       |       eubacteria      |               |       genbank common name     |
2       |       Monera  |       Monera <bacteria>       |       in-part |
2       |       Procaryotae     |       Procaryotae <bacteria>  |       in-part |
2       |       Prokaryotae     |       Prokaryotae <bacteria>  |       in-part |
2       |       Prokaryota      |       Prokaryota <bacteria>   |       in-part |
2       |       prokaryote      |       prokaryote <bacteria>   |       in-part |

We notice this looks like a table. There are some columns, separated by the symbol |. The other file, nodes.dmp is similar, but has many more columns:

head $COURSES/BIN310/module9/NCBI/nodes.dmp
1       |       1       |       no rank |               |       8       |       0       |       1       |       0       |       0|0       |       0       |       0       |               |
2       |       131567  |       superkingdom    |               |       0       |       0       |       11      |       0       |0|       0       |       0       |       0       |               |
6       |       335928  |       genus   |               |       0       |       1       |       11      |       1       |       0|1       |       0       |       0       |               |
7       |       6       |       species |       AC      |       0       |       1       |       11      |       1       |       0|1       |       1       |       0       |               |
9       |       32199   |       species |       BA      |       0       |       1       |       11      |       1       |       0|1       |       1       |       0       |               |
10      |       1706371 |       genus   |               |       0       |       1       |       11      |       1       |       0|1       |       0       |       0       |               |
11      |       1707    |       species |       CG      |       0       |       1       |       11      |       1       |       0|1       |       1       |       0       |       effective current name; |
13      |       203488  |       genus   |               |       0       |       1       |       11      |       1       |       0|1       |       0       |       0       |               |
14      |       13      |       species |       DT      |       0       |       1       |       11      |       1       |       0|1       |       1       |       0       |               |
16      |       32011   |       genus   |               |       0       |       1       |       11      |       1       |       0|1       |       0       |       0       |               |

Let us read these files into R, to inspect them more properly. Unfortunately, even if the files should have tab-separated columns, the standard functions in R have problems reading these files (they only read parts of them). To overcome this I have made some functions for reading/writing such files. These are in an R package called microclass, but this is not the one you find on CRAN. To install the latest version of this, do as follows:

  • Google for “Lars Snipen GitHub” and find my GitHub site.
  • Click Repositories in the upper menu, and find the microclass
  • Click it open, and scroll down for instructions on how to install. Follow the instructions…

In this package you should find the functions read_names_dmp() and read_nodes_dmp().

2.1.1 Exercise - the names.dmp table

The human gut species Ruminococcus gnavus has some impact on human health. Let us look it up, and get to know the NCBI Taxonomy. Make an R script and read the names.dmp file using the read_names_dmp() function from microclass (it takes some seconds to read). Store it in the object names.tbl. There should be 4 columns.

  • How many rows are there in this table?
  • The tax_id is a unique integer for each taxon. How many unique tax_id are there? Note that names are not unique to each taxon, the same taxon (=tax_id) may have several names listed!
  • Filter it to only contain the scientific names (has "scientific name" in the name_class column). How many rows are there now?
  • Filter this again into a new table containing taxa where the name contains (among other texts) th egenus name Ruminococcus in the name_txt column. Use str_detect() and store in the table rum.tbl. How many rows are there?
  • Scroll the table to find an entry with the species name gnavus in addition to the genus name. Find its tax_id. Then filter the names.tbl again, but for this tax_id instead.
  • What does it mean when a genus name is in brackets? (see https://support.nlm.nih.gov/knowledgebase/article/KA-03379/en-us)

2.1.2 Exercise solution

Add this to the script containing the functions above:

library(microclass)

names.tbl <- read_names_dmp("/mnt/courses/BIN310/module9/NCBI/names.dmp")
cat("The names.dmp file has", nrow(names.tbl), "rows\n")
cat("There are", length(unique(names.tbl$tax_id)), "unique tax_id's\n")
names.tbl <- names.tbl %>%
  filter(name_class == "scientific name")
cat("The table with only scientific names has", nrow(names.tbl), "entries\n")
rum.tbl <- names.tbl %>%
  filter(str_detect(name_txt, "Ruminococcus"))
head(rum.tbl)
tax_id_33038.tbl <- names.tbl %>%
  filter(tax_id == 33038)
cat("The proper scientific name is", tax_id_33038.tbl$name_txt[1])

2.1.3 Exercise - taxonomy table in R

Let us see how we can get a table listing the taxonomy of all taxa who are clades descending from a particular taxon.

Extend the script above by also reading the file nodes.dmp into R in a similar way as above, store in nodes.tbl.

Notice only 3 columns were read, since these hold the essential information. Inspect the table in R. Note there is one row for each tax_id (compare the number of rows to the number of unique tax_ids in the previous exercise). For each tax_id there is a rank and also the parent_tax_id. This is the tax_id of the taxon at the rank above, i.e. the ‘parent’ in the hierarchy.

The microclass package also has some functions for handling the NCBI taxonomy. Let us explore this briefly, by adding more code to the R script:

Retrieve the branch ending at tax_id 33038 by

branch.lst <- branch_retrieve(33038, nodes.tbl)

Inspect the list. Notice the name above each tax_id value is the rank. To replace the tax_id integers with the corresponding names, run it through branch_taxid2name() and add the names.tbl as the second argument (see ?branch_taxid2name).

Let us find out how many species there are under the same genus as this organism.

First, find the tax_id of the genus of this organism (see the list from above). Next, use subset_clade() to find the clade that starts at this tax_id. You need to supply the nodes.tbl, and will get a table (subset of nodes.tbl) listing all branches hanging from this genus.

Filter this table to only list species (i.e. not strains or anything else).

Use branch_retrieve() again, but use all tax_ids from the above table as input, together with nodes.tbl. Then run the resulting list through branch_taxid2name() to replace tax_ids with proper names, and finally use branch_list2table() to get it as a table instead of a list. You should end with a table where each row (38 rows?) lists all species having the same genus as the organism we started out with.

2.1.4 Exercise solution

library(microclass)
names.tbl <- read_names_dmp("/mnt/courses/BIN310/module9/NCBI/names.dmp") # skip if already read
nodes.tbl <- read_nodes_dmp("/mnt/courses/BIN310/module9/NCBI/nodes.dmp")

### Retrieving branch ending at 33038
branch.lst <- branch_retrieve(33038, nodes.tbl)
print(branch.lst)
branch_taxid2name(branch.lst, names.tbl)

n.tbl <- subset_clade(2316020, nodes.tbl) %>%
  filter(rank == "species")
species.tbl <- branch_retrieve(n.tbl$tax_id, nodes.tbl) %>%
  branch_taxid2name(names.tbl) %>%
  branch_list2table()
View(species.tbl)

From this we first learn that Ruminococcus gnavus is actually denoted [Ruminococcus] gnavus in the taxonomy, and listed under genus Mediterraneibacter. Thus, we may perhaps expect it change name to Mediterraneibacter gnavus in the future. This is what you must learn to live with in prokaryotic taxonomy. Names change, and organisms are re-placed into new branches as we get a gradually clearer picture of this landscape.

2.1.5 Exercise - human taxonomy

The NCBI taxonomy covers all organisms, not only prokaryotes. Look up Homo sapiens and find all the taxa in the branch of our species! How many species are listed under the same genus as us?

2.1.6 Exercise solution

library(microclass)
names.tbl <- read_names_dmp("$COURSES/BIN310/module9/NCBI/names.dmp")
nodes.tbl <- read_nodes_dmp("$COURSES/BIN310/module9/NCBI/nodes.dmp")

homo.sapiens.tbl <- names.tbl %>%
  filter(name_txt == "Homo sapiens")
homo.sapiens.lst <- branch_retrieve(homo.sapiens.tbl$tax_id[1], nodes.tbl) %>%
  branch_taxid2name(names.tbl)
print(homo.sapiens.lst)

n.tbl <- subset_clade(9605, nodes.tbl) %>%    # 9605 is the tax_id for genus Homo
  filter(rank == "species")
species.tbl <- branch_retrieve(n.tbl$tax_id, nodes.tbl) %>%
  branch_taxid2name(names.tbl) %>%
  branch_list2table()
View(species.tbl)

Notice the huge number of ranks in the branch of Homo sapiens. There are 4 species listed under Homo, but should it not be more?





2.2 Metagenome classification

Let us again start by having sequenced a microbial community, and we have some fastq file(s) of reads. From this, want to we say something about

  • which taxa are there?
  • and also, in which relative quantities?

How should we approach this? Should we try to assemble the reads into longer contigs, and then try to find out what taxon each contig is from? This is almost what we did in module 8 (assembly+binning). It might sound like a good idea since it should be easier to ‘see’ what taxon a longer sequence is from, compared to a (short) read. However, we saw that only the more abundant genomes will be (partly) re-constructed this way. Even if the lower-abundant taxa in there do not have enough reads to be (partly) re-constructed, it may still be possible to detect their presence, and estimate their abundance.

When composition is the main focus, it is more common to classify the reads directly, without any attempt to assemble them first. Once we start to assemble, we also run the risk of reads from different taxa ending up in the same contig (misassemblies), obscuring the classification.

A big difference to the binning we saw in module 8 is the existence of a database. Binning is a clustering of contigs, we never assign them to pre-defined categories, but make up the “categories” (=bins) as we go along. Note that when we start binning we never how many, or which, bins we will find. It is unsupervised learning, for those familiar with machine learning terminology.

Taxonomy classification is supervised learning, which means we have some pre-defined classes (=taxa) that we want to recognize, i.e. we have a database of sequences whose taxa we already know. In order to recognize a read from, say, E. coli we need to have seen some E. coli sequences before, and know what to look for. If our database does not contain any E. coli sequences, we cannot recognize it. Thus, all the metagenome classification methods work along the lines of

  1. Match/map each read, in some way, to a sequence database with known taxa
  2. If it matches ‘well enough’ to some taxon, we assign the read to this taxon

It should be apparent that the database of sequences becomes very important for the success of our classification.

In many cases we might typically find that a (short) read is not specific enough to match one and only one taxon. A common approach is then to use what is named the Least Common Ancestor (LCA) algorithm. This means that

  1. If a read matches both taxon A and B, then work up the taxonomy tree to the rank where their branches meet, and assign to the taxon at that rank.

As an example, a read matches both Escherichia coli and Salmonella enterica, and we cannot assign to neither species. We move up to the genus rank, but the read matches both genus Escherichia and Salmonella equally well. We then moved up to the family rank, and find both are Enterobacteriaceae, and we assign the read to this family. This means some reads are assigned to species, some to genera, some to families, and so on. This LCA approach has become a standard, and most taxonomy classifiers use it.

2.3 Classification with kraken2 and bracken

One of the popular tools for metagenome classification is kraken2. It is also quite common to post-process the results we get from this tool with the software bracken to get the final results.

Let us first have a look at the original kraken paper. The kraken2 has been extended slightly, but mostly to obtain better speed and memory usage, and the original paper describes the idea we are interested in: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4053813/.

Again, we focus on the big picture in the Materials and Methods section, supported by Figure 1.

  • How are K-mers used by kraken?
  • What is meant by a classification path?
  • What determines the taxon to which a read is finally classified?

As mentioned above, it is quite common to post-process the kraken2 results with bracken. The paper describing the bracken software is found here: https://peerj.com/articles/cs-104/. Let us focus on

  • How is Bayes theorem used to re-assign reads?

In this video I have tried to see the big picture on how these two methods work as one.

2.4 Running kraken2 and using enough memory

Let us run the kraken2 software, and try to classify the Illumina reads we worked on in module 8. Here is the shell script we start out with:

#!/bin/bash

#SBATCH --nodes=1
#SBATCH --reservation=BIN310
#SBATCH --account=bin310
#SBATCH --ntasks=10
#SBATCH --mem=50G
#SBATCH --time=01:00:00
#SBATCH --job-name=kraken2
#SBATCH --output=kraken2_%j.log

##############
### Settings
###
kraken_dbase=/mnt/databases/kraken2/Standard
threads=10
out_dir=$SCRATCH/kraken2
krk2_report=$out_dir/kraken2_report.txt
r1=$COURSES/BIN310/module8/metagenome30_R1.fastq.gz
r2=$COURSES/BIN310/module8/metagenome30_R2.fastq.gz
if [ ! -d $out_dir ]
then
  mkdir $out_dir
fi

#####################
### Running kraken2
###
apptainer exec $HOME/module9/kraken2:2.1.3--pl5321hdcf5f25_0.sif kraken2 \
--threads $threads --use-names --db $kraken_dbase --report $krk2_report  \
--paired $r1 $r2 > $out_dir/delete_me.txt

Briefly on the content of this script: We reserve 50 gigabytes,and hope this is enough. When running kraken2 we specify --use_names in order to have names in addition to the tax_ids in the output. The standard kraken2 database uses the NCBI taxonomy, by the way.

We also specify --report $report_file to get a summary file of the results. The last output, that we redirect to $out_dir/delete_me.txt is not something we will look into. This lists the results for each single read-pair, and we just delete this file (hence the name!) since we already have the summary report file with all the results we need.

Since we have paired-end reads we specify the --paired option, resulting in each read-pair being classified. We could also have supplied single read files (e.g. nanopore data), just remove the --paired option and specify one file with reads instead of two.

Make the shell script based on the code above, save it in your module9 folder and sbatch.

You should now find that this actually crashes! Inspect the log-file, here is what I find:

Loading database information.../var/tmp/slurmd/job12875115/slurm_script: line 31: 26810 Killed                  apptainer exec kraken2:2.1.3--pl5321hdcf5f25_0.sif kraken2 --threads $threads --use-names --db $kraken_dbase --report $krk2_report --paired $r1 $r2 > $out_dir/delete_me.txt
slurmstepd: error: Detected 3 oom_kill events in StepId=12875115.batch. Some of the step tasks have been OOM Killed.

Whenever you see an oom_kill event related to an error in your log-file, it indicates too little memory was reserved (OOM = Out Of Memory). OK, so we need to use more memory, but how much more? It is most likely the size of the database that is critical here. Inspect the folder where we have the database (see path above):

ls -alh /mnt/databases/kraken2/Standard

The file hash.k2d is the database that is loaded when we run kraken2. From my listing this is 67GB. We also need some memory to read the fastq files, let us try to increase the memory to 75GB, and re-run. This should be enough to give us some results. Always check the size of the database to get an idea of how much memory you need.

We will extend this script shortly, and don’t delete any of the results in the $SCRATCH folder yet…

Let us read the report from kraken2 into R, and have a look (remember to edit the path below to read your file):

library(tidyverse)
krk.tbl <- read_delim("/mnt/SCRATCH/larssn/kraken2/kraken2_report.txt", delim = "\t",
                      col_names = c("percent", "clade_count", "tax_count", "rank", "tax_id", "name"),
                      trim_ws = T)

The column names used here are my own ‘invention’, the file has no column names.

The first column is a percent of the total number of read pairs, and columns clade_count and tax_count are the read counts. The clade_count is the number of read pairs assigned to all taxa in the clade starting at the given taxon, while tax_count are the reads assigned to this specific taxon. This means the clade_counts are always rather large values when we are high up in the hierarchy, since these are the sum of all reads assigned to all taxa in the branches below that taxon.

The rank is given with a single letter as

Letter Meaning
U Unclassified
R root
K Kingdom
D Domain
P Phylum
C Class
O Order
F Family
G Genus
S Species

Note that ‘root’ is not really a rank, it just symbolizes the start of the taxonomy tree (“some living organism”). You also find some ranks with an integer, e.g. S1. This means a sub-rank, e.g. a C1 would mean some sub-Class, which is an intermediate rank between Class and Order. The S1 is a category below species (e.g. strain or sub-species).

2.4.1 Exercise

Make an R script with the R code above for reading the kraken2 report file. Remember to edit the path to read your report file. Inspect the krk.tbl in RStudio. How large fraction of the read pairs did kraken2 not recognize (unclassified)?

How many different taxa did kraken2 assign some reads to? This is the same as the number of rows where tax_count has a value larger than 0.

How many read pairs did kraken2 assign to each rank? First, to avoid all the sub-ranks, mutate the rank texts such that the numbers are removed, e.g. replace all S1 by just S and so on. Use str_remove(). To specify the pattern to remove, use [0-9]+, which is a pattern that matches all positive integers. Next, use group_by(rank) and then a summarise(tax_count = sum(tax_count)) to count the number of read pairs for each rank.

Make a bar chart of the results, and order the ranks in the order you see in the table above. How many read-pairs are assigned to the various ranks?

2.4.2 Exercise solution

library(tidyverse)

krk.tbl <- read_delim("/mnt/SCRATCH/larssn/kraken2/kraken2_report.txt", delim = "\t",
                      col_names = c("percent", "clade_count", "tax_count", "rank", "tax_id", "name"),
                      trim_ws = T)
cat("The fraction of unclassified reads is", krk.tbl$percent[1] / 100, "\n")
cat("kraken2 assigns reads to", sum(krk.tbl$tax_count > 0) - 1, "taxa\n")

rank.tbl <- krk.tbl %>%
  mutate(rank = str_remove_all(rank, "[0-9]+")) %>%
  group_by(rank) %>%
  summarise(tax_count = sum(tax_count)) %>%
  mutate(rank = factor(rank, levels = c("U", "R", "K", "D", "P", "C", "O", "F", "G", "S")))
fig <- ggplot(rank.tbl) +
  geom_col(aes(x = rank, y = tax_count))
print(fig)



2.5 Running bracken on the kraken2 output

If you did the last exercise above, you will see that not all reads are assigned to the lowest rank, Species. Even if all the genomes in the kraken2 database has some species name, some reads are assigned at higher ranks due to the LCA algorithm (see above). The bracken software has been developed to re-assign reads from these higher ranks down to the lower ranks, in order to get a complete result for the composition at one selected rank, e.g. at Species.

To run bracken we typically just add some code to the existing kraken2 script:

##############
### Settings
###
brk_report=$out_dir/bracken_report.txt
target_rank=S

#####################
### Running bracken
###
apptainer exec bracken:2.8--py39hc16433a_0.sif bracken \
-d $kraken_dbase -i $krk2_report -o $brk_report -l $target_rank -r 100 -t 1

#########################
### Copying and cleaning
###
mv $brk_report $(basename $brk_report)
rm $out_dir/delete_me.txt

Here we tell bracken to aggregate reads at the Species rank ($target_rank=S). In the command line we use the option -l $target_rank to specify this. You may aggregate to any rank, using the one-letter codes for the various ranks we saw in the small table above, but we focus on Species now.

The output file bracken_report.txt is what we are interested in, rather than the kraken2_report.txt we inspected above, and we copy this back to the folder we are working in (The module9 folder).

2.5.1 Exercise - the bracken results

Add the bracken code to the kraken2 script, and re-run the entire script, using first kraken2 and then bracken.

Read the file brk_report.txt into R, using this code:

library(tidyverse)

brk.tbl <- read_delim("bracken_report.txt", delim = "\t") %>% 
  rename(tax_id = taxonomy_id, rank = taxonomy_lvl, kraken2_reads = kraken_assigned_reads,
         bracken_reads = new_est_reads, bracken_abundance = fraction_total_reads) %>% 
  arrange(desc(bracken_abundance))

The table brk.tbl are the final classification results, giving us the composition of Species in this case. Here we can now see how many reads have been classified to the various species (the bracken_read column) and how large fraction this is of the total number of reads (bracken_abundance column).

In this metagenome we know for certain there are exactly 30 species.

How many species are detected in total?

How many of these have been assigned more than 0.1% of the reads?

2.5.2 Exercise solution

library(tidyverse)

brk.tbl <- read_delim("bracken_report.txt", delim = "\t") %>%
  rename(tax_id = taxonomy_id, rank = taxonomy_lvl, kraken2_reads = kraken_assigned_reads,
         bracken_reads = new_est_reads, bracken_abundance = fraction_total_reads) %>%
  arrange(desc(bracken_reads))
cat("Total number of species:", nrow(brk.tbl), "\n")
cat("Number of species with at least 0.1% abundance:", nrow(brk.tbl %>% filter(bracken_abundance >= 0.001)), "\n")



Notice how we

  • First use kraken2 to classify reads to taxa…
  • …then use bracken on the kraken2 results to aggregate all reads to some specified rank.

The results in bracken output file are estimates of

  • Which species are in this metagenome
  • And at which abundances

The list of species is very long! This is typical for this, and many other, taxonomic classification methods. Notice most of the species have very few reads assigned, i.e. they make up a tiny fraction of the total (last column). If we count all these species as ‘positives’ (species found in our metagenome), we probably have a huge number of false positives. It is very likely that most of the species with very low read count are just artefacts of the sequencing and the method we used. Remember we classify every single read pair. Reads with some sequencing error, coming from species A, may then very well match a closely related species B even better. Thus, listing all species with some tiny read count as present in the metagenome will result in many false positives. Often we simply use a threshold at some low abundance, and report something like: “We found x species with abundance above y% in the metagenome”. Species with lower abundances may be in there, but probably have little impact on the microbial community anyway.

2.6 The kraken2 and bracken database

The kraken2 software allows us to make our own database, and use this for classification. This is not something we will dig into here, but it should be mentioned, since this turn out quite important for all practical uses. You will never recognize a taxon which is not in the database! Also, if one branch in the taxonomy has a lot of genomes in the database, and another has few, this tend to affect the results we get.

Making custom databases for various types of environments is therefore important, to get as good classifications as possible. Some of us at KBM published such a resource for studying the human gut environment some time ago, see this scientific publication in Microbiome and also an article about this in Forskning.no. Note how this last article includes pictures of scientists in white lab coats, while in reality this was a pure dry-lab work, i.e. 100% bioinformatics!

Once you have a kraken2 database, the bracken also has commands for building its own ‘database’ inside the kraken2 database folder. If you inspect the folder at /mnt/databases/kraken2/standard (that we used above) you can see the files it contains. The raw data (genome fasta files) that were used have been deleted, since these takes up a lot of disk space.





3 Evaluating taxonomic classification

How good are the classifications we get, from the methods above or any other methods?

To answer this, we first need some gold standard data sets where we know exactly

  • Which taxa are in there
  • And in which relative quantities

Such data sets exist. We can simulate data sets, by mixing some genomes and then use sequencing simulation software to produce the reads. This is exactly what we have done to produce the data set we used in module 8 (the metagenome30 fastq files). However, simulated data are often slightly ‘nicer’ than real data, and the results we get are probably too good to reflect real life results.

Another, and more expensive and cumbersome approach, is to construct mock communities. This means we mix the organisms to known relative quantities in vitro in the lab, and then sequence this using the normal procedures. Such data are very close to real life metagenome data, but the actual compositions are not as precisely determined as for simulated data, since it is difficult to control this perfectly in vitro. Also, such mock communities always have rather few taxa (for practical reasons), and in this way do not resemble real data where we may have hundreds or even thousands of species.

Anyway, results of the type we got from bracken above must then be compared to this gold standard in some way, and there are several statistics we may compute indicating how good/bad the results turned out.

3.1 The OPAL paper

Let us look at the following paper on how to compare the output from such classification tools to a gold standard: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1646-y. Let us focus on the section Metrics and accompanying visualizations.

Here is a short video explaing the basic statistics we will use.

In the tab-delimited file $COURSES/BIN310/module9/gold_standard.txt we have the actual composition of species in the metagenome30 data set. Let us read this into R:

gold.tbl <- read_delim("/mnt/courses/BIN310/module9/gold_standard.txt", delim = "\t")
head(gold.tbl)

The third column is the actual abundance for each taxon, and if you sum all these values they sum to 1.0 (sum(gold.tbl$true_abundance)).

To compare the bracken results (or any other result) to this, we need to add such abundances as a new column, and make certain

  • The abundances are put into the correct row
  • The abundances are relative values that sum to 1.0

Once we have such a table, we can compare the bracken results to the gold standard. If we have other results, from other methods, we could add these as well, and compare these to the gold standard in the exact same way.

3.1.1 Exercise - join bracken results to the gold standard

Make an R script where you first read the gold standard as above, but select only the columns tax_id and true_abundance.

Then, read the bracken results into another table, as we did above, and again, select only the columns tax_id and bracken_abundance.

Notice both tables have a column tax_id with the same type of information. Next, join the two tables by the tax_id column as the joining information, use full_join(___, ___, by = "tax_id") and store this in joined.tbl.

After such a joining, we will typically find some NAs (missing values) in the columns listing abundances. This is because the tax_ids in the table now include all those from both tables. If a tax_id was found in only one the the tables we joined, the columns coming from the other table will be filled with NA in the corresponding row. All these NA should be replaced by 0. In this case a missing value is the same as zero abundance. You must not infer from this that NA always means the value 0! Here NA means ‘not counted’ which is the same as ‘counted zero times’. If you measured something, e.g. lengths, a missing measurement does not mean you measured the length zero! To replace all NA with 0 use joined.tbl <- joined.tbl %>% replace(is.na(.), 0).

3.1.2 Exercise - \(L_1\) norm, precision and recall

Based on the joined.tbl, compute the \(L_1\) norm between the abundances predicted by bracken and the gold standard abundances. See formula in the OPAL paper and the video above.

We would like to compute precision and recall from such results. Here is a function for computing the precision:

precision <- function(true_abd, predicted_abd, threshold = 0){
  actual <- which(true_abd > threshold)
  predicted <- which(predicted_abd > threshold)
  TP <- sum(actual %in% predicted)
  FP <- sum(!(predicted %in% actual))
  return(TP / (TP + FP))
}

Here true_abd must be a vector (=column) with the gold standard abundances, and predicted_abd another vector with some abundances from some method, e.g. bracken. The threshold is the abundance limit between Presence and Absence of a taxon, i.e. an abundance as low as this threshold is considered the same as Absent from the metagenome. It is by default 0 here (as in the OPAL paper), but the user may choose a higher threshold. Add this function to your R script, and use it to compute the precision for thresholds 0 and then for threshold 0.001.

Try to make a similar function for computing the recall, by copying and modifying the precision function above. Then use it to compute the recall for the same bracken results, using the same two thresholds as above.

3.1.3 Exercise solution

library(tidyverse)

### Solution to first exercise
gold.tbl <- read_delim("/mnt/courses/BIN310/module9/gold_standard.txt", delim = "\t") %>%
  select(tax_id, true_abundance)
brk.tbl <- read_delim("bracken_report.txt", delim = "\t") %>%
  rename(tax_id = taxonomy_id, rank = taxonomy_lvl, kraken2_reads = kraken_assigned_reads,
         bracken_reads = new_est_reads, bracken_abundance = fraction_total_reads) %>%
  select(tax_id, bracken_abundance)
joined.tbl <- full_join(gold.tbl, brk.tbl, by = "tax_id") %>%
  replace(is.na(.), 0)

### Solution to second exercise
L1.norm <- sum(abs(joined.tbl$true_abundance - joined.tbl$bracken_abundance))
cat("BRACKEN RESULTS:\n")
cat("The L_1 norm is", L1.norm, "\n")

precision <- function(true_abd, predicted_abd, threshold = 0){
  true <- which(true_abd > threshold)
  predicted <- which(predicted_abd > threshold)
  TP <- sum(true %in% predicted)
  FP <- sum(!(predicted %in% true))
  return(TP / (TP + FP))
}
recall <- function(true_abd, predicted_abd, threshold = 0){
  true <- which(true_abd > threshold)
  predicted <- which(predicted_abd > threshold)
  TP <- sum(true %in% predicted)
  FN <- sum(!(true %in% predicted))
  return(TP / (TP + FN))
}
pres.0 <- precision(joined.tbl$true_abundance, joined.tbl$bracken_abundance)
rec.0 <- recall(joined.tbl$true_abundance, joined.tbl$bracken_abundance)
cat("With threshold 0: Precision =", pres.0, " and Recall =", rec.0, "\n")
pres.0.001 <- precision(joined.tbl$true_abundance, joined.tbl$bracken_abundance, threshold = 0.001)
rec.0.001 <- recall(joined.tbl$true_abundance, joined.tbl$bracken_abundance, threshold = 0.001)
cat("With threshold 0.001: Precision =", pres.0.001, " and Recall =", rec.0.001, "\n")





4 The kaiju software

Let us have a look at an alternative to kraken2/bracken called kaiju.

What is special about kaiju is that it uses protein sequences for classification. Thus, reads are translated into amino acid sequences, in all six reading frames, and then compared to a database of proteins. This comparison is done by the use of the Burrows-Wheeler transform, that we first saw in module 5.

The major idea behind the use of proteins for taxonomic classification is to get some results even if our data are from completely new environments, where we have very few good reference genomes in our databases. A tool like kraken2 requires that the database contains, more or less, the same taxa that we have been sequencing. This was also observed by the people behind kaiju when they analyzed data from some hot springs, where the microbial community contains some taxa that are poorly represented among sequenced genomes. A tool like kraken2 will then leave the vast majority of reads as unclassified. However, proteins are more conserved, and even if the genomes are new to us, the proteins are often similar. Then, instead of leaving a vast number of reads as unclassified, we may perhaps still give them some taxonomic assignment which is perhaps not exactly correct, but might still give us a good picture of what it resembles among the genomes we do know.

I think we can say that

  • You may use kaiju if your data are from poorly studied environments, where you expect to find many genomes with poor match to already sequenced genomes (e.g. soils, water, air etc).
  • The results you get by kaiju tend to have a lower resolution, i.e. we cannot expect to separate closely related taxa, since their proteins are even more similar than their genomes.



4.1 Using kaiju

Let us run kaiju on our metagenome30 data, just like we did for kraken2:

#!/bin/bash

#SBATCH --nodes=1
#SBATCH --reservation=BIN310
#SBATCH --account=bin310
#SBATCH --ntasks=10
#SBATCH --mem=130G
#SBATCH --time=01:00:00
#SBATCH --job-name=kaiju
#SBATCH --output=kaiju_%j.log

#############
### Settings
###
threads=10
r1=$COURSES/BIN310/module8/metagenome30_R1.fastq.gz
r2=$COURSES/BIN310/module8/metagenome30_R2.fastq.gz
kaiju_dbase=/mnt/databases/kaiju/nr_NCBI/
kaiju_fmi=$kaiju_dbase/kaiju_db_nr_euk.fmi
kaiju_nodes_dmp=$kaiju_dbase/nodes.dmp
kaiju_names_dmp=$kaiju_dbase/names.dmp
out_dir=$SCRATCH/kaiju
kaiju_report=$out_dir/kaiju_report.txt
target_rank=species
if [ ! -d $out_dir ]
then
  mkdir $out_dir
fi

##################
### Running kaiju
###
apptainer exec $HOME/module9/kaiju:1.9.2--h5b5514e_1.sif kaiju \
-v -z $threads -t $kaiju_nodes_dmp -f $kaiju_fmi -i $r1 -j $r2 -o $out_dir/delete_me.txt

#####################################
### ...and aggeregating to
### target rank
###
apptainer exec $HOME/module9/kaiju:1.9.2--h5b5514e_1.sif kaiju2table \
-t $kaiju_nodes_dmp -n $kaiju_names_dmp -r $target_rank -o $kaiju_report $out_dir/delete_me.txt

###############
### Cleaning
###
cp $kaiju_report $(basename $kaiju_report)

Here we first run kaiju using the fastq files in $r1 and $r2 as input, and producing the output file $out_dir/delete_me.txt. Then, we use the tool kaiju2table, which is also supplied in the same container, to aggregate the read counts to some target rank. Here we have chosen the rank species as we did for bracken. This is perhaps a little too low in the taxonomy, since it is difficult to separate between species based on proteins only. It is perhaps more common to aggregate to the rank genus (or even higher) when using this tool. We are only interested in the file $kaiju_report that is produced after this step, and this we copy to our module9 folder before deleting the entire $out_dir.

The database we used above is the nr database from NCBI, and the file kaiju_db_nr_euk.fmi is the Burrows-Wheeler FM index for this database. We notice that kaiju also uses the NCBI Taxonomy database, and has its version of the files names.dmp and nodes.dmp that we supply as input. Notice we need to reserve much more memory than for kraken2. Inspect the database folder, and see how large the FM index is!

Just like kraken2, the kaiju is supplied with some ways of installing a number of databases, and the nr is the most comprehensive one. Unfortunately, it is not as straightforward to customize your own database as it is for kraken2. But, perhaps the need for this is not as acute? Since this is a tool we use to cover as widely as possible, the nr database seems to be the natural choice.

4.1.1 Exercise - compare kaiju results to the gold standard

Make a shell script from the code above, and sbatch to get results from kaiju.

Let us extend the R script where we compared the bracken results to the gold standard. Read the kaiju report file (it is also a tab-delimited text file), select and rename two of the columns by select(tax_id = taxon_id, kaiju_abundance = percent). Note how we can both select and rename a column in one go.

Then compute relative abundances as mutate(kaiju_abundance = kaiju_abundance / sum(kaiju_abundance)) and store the results in kaiju.tbl.

Join this table to the joined.tbl as we did above (joining by "tax_id"). Compare these results to the gold standard in a similar way to what we did for the bracken results above (\(L_1\) norm, precision and recall). Are these results closer to the gold standard?

4.1.2 Exercise solution

This is the full script after we add this new code to the script from above, where we read the gold standard and bracken results:

library(tidyverse)

### Read gold standard
gold.tbl <- read_delim("/mnt/courses/BIN310/module9/gold_standard.txt", delim = "\t") %>%
  select(tax_id, true_abundance)

## Read bracken results, and join with gold standard
brk.tbl <- read_delim("bracken_report.txt", delim = "\t") %>%
  rename(tax_id = taxonomy_id, rank = taxonomy_lvl, kraken2_reads = kraken_assigned_reads,
         bracken_reads = new_est_reads, bracken_abundance = fraction_total_reads) %>%
  select(tax_id, bracken_abundance)
joined.tbl <- full_join(gold.tbl, brk.tbl, by = "tax_id") %>%
  replace(is.na(.), 0)

### Read kaiju results and join with the existing joined.tbl
kaiju.tbl <- read_delim("kaiju_report.txt", delim = "\t") %>%
  select(tax_id = taxon_id, kaiju_abundance = percent) %>%
  mutate(kaiju_abundance = kaiju_abundance / sum(kaiju_abundance))
joined.tbl <- full_join(joined.tbl, kaiju.tbl, by = "tax_id") %>%
  replace(is.na(.), 0)

### Define functions to be used below:
precision <- function(true_abd, predicted_abd, threshold = 0){
  true <- which(true_abd > threshold)
  predicted <- which(predicted_abd > threshold)
  TP <- sum(true %in% predicted)
  FP <- sum(!(predicted %in% true))
  return(TP / (TP + FP))
}
recall <- function(true_abd, predicted_abd, threshold = 0){
  true <- which(true_abd > threshold)
  predicted <- which(predicted_abd > threshold)
  TP <- sum(true %in% predicted)
  FN <- sum(!(true %in% predicted))
  return(TP / (TP + FN))
}

### Results for bracken:
L1.norm <- sum(abs(joined.tbl$true_abundance - joined.tbl$bracken_abundance))
cat("BRACKEN RESULTS:\n")
cat("The L_1 norm is", L1.norm, "\n")
pres.0 <- precision(joined.tbl$true_abundance, joined.tbl$bracken_abundance)
rec.0 <- recall(joined.tbl$true_abundance, joined.tbl$bracken_abundance)
cat("With threshold 0: Precision =", pres.0, " and Recall =", rec.0, "\n")
pres.0.001 <- precision(joined.tbl$true_abundance, joined.tbl$bracken_abundance, threshold = 0.001)
rec.0.001 <- recall(joined.tbl$true_abundance, joined.tbl$bracken_abundance, threshold = 0.001)
cat("With threshold 0.001: Precision =", pres.0.001, " and Recall =", rec.0.001, "\n")

### Results for kaiju:
cat("KAIJU RESULTS:\n")
L1.norm <- sum(abs(joined.tbl$true_abundance - joined.tbl$kaiju_abundance))
cat("The L_1 norm is", L1.norm, "\n")
pres.0 <- precision(joined.tbl$true_abundance, joined.tbl$kaiju_abundance)
rec.0 <- recall(joined.tbl$true_abundance, joined.tbl$kaiju_abundance)
cat("With threshold 0: Precision =", pres.0, " and Recall =", rec.0, "\n")
pres.0.001 <- precision(joined.tbl$true_abundance, joined.tbl$kaiju_abundance, threshold = 0.001)
rec.0.001 <- recall(joined.tbl$true_abundance, joined.tbl$kaiju_abundance, threshold = 0.001)
cat("With threshold 0.001: Precision =", pres.0.001, " and Recall =", rec.0.001, "\n")

4.1.3 Exercise - stacked bar chart plot

Once we have a table like joined.tbl it is customary to plot the abundances as stacked bar charts. However, we should not use all taxa, since this make the plot messy. Add code to the script from above to make the plot, and let us do this very step-by-step. You should also run this code and inspect the results after each new step, to understand how the code works:

  • Filter the joined.tbl table to keep only the gold standard taxa, i.e. the rows where the gold standard has positive abundance. Name this new table pruned.tbl.
  • Add the names of these taxa as a new column, since names are better than tax_id to display in plots. The names are in the original gold standard table. Read the gold standard table again, select only the name and tax_id, and then full_join() with your pruned.tbl, again using tax_id as the joining column.
  • To make the plot appear nicer, mutate the name-column to a factor, and use the names as levels in the order they appear. Try also to comment out this, and run the code to plot without this statement, to see the difference. It has to do with the ordering of the colors in the plot.
  • Use pivot_longer() to stack the abundances into a single column named Abundances, and with the column names in a new column named Tool. Name this table long.tbl.
  • Plot the long.tbl, with ggplot() using geom_col() and map the fill-colors to the names.

Here is a skeleton for this code:

pruned.tbl <- joined.tbl %>% 
  filter(___ > 0)

pruned.tbl <- read_delim("/mnt/courses/BIN310/module9/gold_standard.txt", delim = "\t") %>% 
  select(___, ___) %>% 
  full_join(___, by = "tax_id")

___ <- pruned.tbl %>% 
  mutate(name = factor(___, levels = name)) %>% 
  pivot_longer(cols = c(true_abundance, ___, ___),
               values_to = "Abundances", names_to = "Tool")

fig <- ggplot(long.tbl) +
  geom_col(aes(x = Tool, y = Abundances, fill = name), color = "black") +
  coord_flip() +
  theme(legend.position = "bottom", legend.key.width = unit(10, "points")) +
  guides(fill = guide_legend(nrow = 10))
print(fig)

Fill in the missing parts, and run the code step-by-step. Inspect the results after each step, and try to understand what happens.

The plot should look something like this:

This plot reveals nothing about false positives, but it does illustrate to what extent the two methods give us true positives and false negatives, and especially if the abundances are close to that of the gold standard or not. A ‘perfect’ prediction of taxonomic composition would have all colored sectors of the exact same sizes as the gold standard.

4.1.4 Exercise solution

Add to the script from the previous exercise:

pruned.tbl <- joined.tbl %>%
  filter(true_abundance > 0)

pruned.tbl <- read_delim("/mnt/users/larssn/BIN310/coursedata/module10/gold_standard.txt", delim = "\t") %>%
  select(tax_id, name) %>%
  full_join(pruned.tbl, by = "tax_id")

long.tbl <- pruned.tbl %>%
  mutate(name = factor(name, levels = name)) %>%
  pivot_longer(cols = c(true_abundance, bracken_abundance, kaiju_abundance),
               values_to = "Abundances", names_to = "Tool")

fig <- ggplot(long.tbl) +
  geom_col(aes(x = Tool, y = Abundances, fill = name), color = "black") +
  coord_flip() +
  theme(legend.position = "bottom", legend.key.width = unit(10, "points")) +
  guides(fill = guide_legend(nrow = 10))
print(fig)





5 Classifying genomes

In module 8 we did some assembly of metagenome reads, and then some binning and quality check of the resulting contigs in order to re-construct some of the more abundant genomes in the microbial community. Once we have such MAGs (Metagenome Assembled Genomes), we would also like to assign these to some taxon, i.e. do a taxonomic classification of entire genomes.

5.1 The gtdbtk tool

In principle, we could use a tool like kraken2, but this is built for classifying a huge number of shorter reads. Having only a few very long sequences (contigs), there are other approaches that is more likely better, also taking into account the information about genes in the MAG. One such tool is gtdbtk. This will take a genome (or a set of binned contigs) as input, and use the GTDB database (https://gtdb.ecogenomic.org/) to assign this to some taxon. Note that GTDB is not the same as the NCBI Taxonomy, and some taxa may have different names and/or be in a different clade here.

Let us use the software gtdbtk to classify the MAGs we re-constructed in module 8. In $COURSES/BIN310/module9/MAG/ you find a copy of the MAGs we found, or you may use your own results. Download the container as usual.

In order to classify, the gtdbtk must of course also have a database. In many software tools there is an option where we simply give the path to where the database is found on this system. Here we have a different solution.

First, make the following script, save it in your module9 folder and sbatch:

#!/bin/sh

#SBATCH --nodes=1
#SBATCH --reservation=BIN310
#SBATCH --account=bin310
#SBATCH --ntasks=10
#SBATCH --mem=180G
#SBATCH --time=02:00:00
#SBATCH --job-name=gtdbtk
#SBATCH --output=gtdbtk_%j.log

##############
### Settings
###
threads=10
genomes_dir=$COURSES/BIN310/module9/MAG
sketch_file=$COURSES/BIN310/module9/gtdbtk_release207_v2.msh
out_dir=$SCRATCH/gtdbtk

###################
### Runing GTDBTk
###
apptainer exec $HOME/module9/gtdbtk:2.3.2--pyhdfd78af_0.sif gtdbtk classify_wf \
 --genome_dir $genomes_dir \
 --out_dir $out_dir \
 --cpus $threads \
 --extension fasta \
 --mash_db $sketch_file

Notice we need quite a lot of memory since the database is large. Notice also how we supply a mash sketch file (see module 7) here, the mash software is obviously used here.

Notice we never specify any database. If you sbatch this you should find it crashes, and inspect the log-file. It should look something like:

[2023-09-18 14:26:09] INFO: GTDB-Tk v2.3.2
[2023-09-18 14:26:09] INFO: gtdbtk classify_wf --genome_dir /mnt/courses/BIN310/module9/MAG --out_dir /mnt/SCRATCH/larssn/gtdbtk --cpus 10 --extension fasta --mash_db /mnt/courses/BIN310/module9/gtdbtk_release207_v2.msh

================================================================================
                                     ERROR                                      
________________________________________________________________________________

          The 'GTDBTK_DATA_PATH' environment variable is not defined.           

            Please set this variable to your reference data package.            
           https://ecogenomics.github.io/GTDBTk/installing/index.html           
================================================================================
[2023-09-18 14:26:09] ERROR: Controlled exit resulting from early termination.

It turns out we have to create an environment variable named GTDBTK_DATA_PATH, and this should contain the correct path to the GTDB database on our system. The correct path is in our case /mnt/databases/GTDBTK/release207_v2/. In order to turn a ordinary shell variable into an environment variable we use the export statement. Add the following lines to your script, before you run gtdbtk:

##################################
### Setting environment variable
### with path to database
###
export GTDBTK_DATA_PATH=/mnt/databases/GTDBTK/release207_v2/

Save, and sbatch. Hopefully the computations are running now!

5.1.1 Exercise - the results

In the previous module we assembled, binned and de-replicated the bins to produce the 6 MAGs we now classified. The table of results from that exercise is found in the file /mnt/courses/BIN310/module9/MAG.tbl.RData. Let us now add the taxonomy information to this table. Make a script where you

  • Use load() on the RData file to load the MAG.tbl.
  • Read in the summary file (gtdbtk.bac120.summary.tsv) you find in the output folder from the above script.

In the summary file there are several columns you may have a look at, but we will focus only on the first two: The user_genome you may rename to MAG_id, since this corresponds to that column in the MAG.tbl. The second column (classification) contains the predicted taxonomy. Now, select these two columns only and join this to the MAG.tbl such that the latter table also has some taxonomy for each MAG.

How do the species classification we got here correspond to the truth? Read in the gold standard as we did above (in /mnt/courses/BIN310/module9/gold_standard.txt) and see if the same species are listed among the most abundant taxa in that table. Since the NCBI Taxonomy and the GTDB taxonomy is not identical, the names may differ, but it could also be that the gtdbtk mis-classifies some of the genomes.

5.1.2 Exercise solution

First, to complete the MAG.tbl with an additional taxonomy (classification) column:

library(tidyverse)

load("/mnt/courses/BIN310/module9/MAG.tbl.RData")
MAG.tbl <- read_delim("/mnt/SCRATCH/larssn/gtdbtk/gtdbtk.bac120.summary.tsv") %>%
  select(MAG_id = user_genome, classification) %>%
  right_join(MAG.tbl, by = "MAG_id")
View(MAG.tbl)

Next, to compare against the gold standard, add the lines:

gold.tbl <- read_delim("/mnt/courses/BIN310/module9/gold_standard.txt")
compare.tbl <- MAG.tbl %>%
  mutate(name = word(classification, -1, sep = "_")) %>%
  full_join(gold.tbl, by = "name")

It looks like 3 out of the 6 predicted species are found among the gold standard species. We also note that the MAG classified as Bifidobacterium vaginale may be some of the other Bifidobacterium species in the gold standard, and the same for Stenotrophomonas pavanii (Stenotrophomonas maltophilia in gold standard). Even if the exact species is not correct, the genus may be. We should keep an open mind with respect to this, since MAGs are in general uncertain in their nature, as we have seen.