Introduction

In this assignment you will estimate the species composition of a metagenome, using three different taxonomic classifiers, and then compare these results to a gold standard showing to see how close to the ‘true’ composition the different approaches come. This is typically something we do to evaluate the ‘goodness’ of the taxonomic classifiers.

This is based mostly on the topics from Module 9.

As before, you make a report in RMarkdown, and hand in the HTML-file on Canvas. As usual, you include essential code, figures and some short text discussing the results and the questions below.

Add links to GitHub or similar websites for the software tools you use. Think of this as a preliminary report in a larger scientific project.

The data

In $COURSES/BIN310/assignment4/fastq/ you find pairs of fastq files, one pair for each student username. You use the files with your username only. These files contain paired-end reads after Illumina shotgun sequencing of a metagenome.

In $COURSES/BIN310/assignment4/gold_standard/ you find corresponding text files with student user names, and again you only use the one having your username in it. This contains some columns with the gold standard composition of your metagenome, i.e. it lists the species and their relative abundances. The columns are separated by a blank (" ") in these files, not a tab.

Standard `kraken2` and `bracken`

First, make a script where you use kraken2 with default settings and its standard database to classify the reads in your fastq files. The standard database is found in /mnt/databases/kraken2/Standard.

Then use bracken on the kraken2 report to get final estimates of read counts for the taxa. Assign reads to the species rank in bracken.

Customized database

Next, repeat the procedure above, using kraken2 and then bracken, but this time use the custom database in $COURSES/BIN310/assignment4/krk2_db/. This database contains only a smallish set of genomes from which your data set contains a random subset. Thus, this can be thought of as very close to a ‘perfect’ database for classifying the data you have.

The `metaphlan` software

Next, classify your data with the software metaphlan. We have not used this in the modules, but you should be able to do this on your own.

Download a container from galaxy, and make certain you get a container for metaphlan version 4 (NB! not metaphlan2). Make a shell script for running metaphlan on your data, using the command line (you must edit the container path/name as usual):

apptainer exec $HOME/assignment4/metaphlan:4.0.6--pyhca03a8a_0.sif metaphlan \
$r1,$r2 --input_type fastq --bowtie2out $out_dir/bwt2.bz2 --nproc $threads \
--index mpa_vOct22_CHOCOPhlAnSGB_202212 --bowtie2db $metaphlan_db -o $out_file

You need of course to specify all shell variables used in this command line before the command line is executed in your script. Use metaphlan_db=/mnt/databases/metaphlan_vOct22, which is where we have the database for this tool. The output will be a simple text file. The rest you should manage on your own. Use the same number of threads, memory and wall time as you did for kraken2.

Find documentation on metaphlan on the internet, and write a short description (max 10 sentences) of the main idea behind how metaphlan classifies reads. Focus on

clade-specific genes
read-mapping

Compare species classifications

In this part you must compare the results you got above to the gold standard. Everything is done by R coding.

Read your gold standard table into R, and select only the columns tax_id and abundance (you may rename the latter when selecting it). From the exercises above, you should have 3 different results for classifications of the same metagenome:

Standard kraken2 and bracken.
Customized kraken2 and bracken.
The metaphlan results.

Read these results into R, and store them as tables.

The bracken tables give you some estimated relative abundances. Rename this column indicating the method/database behind the abundance estimates (e.g. bracken_standard_abundance or bracken_custom_abundance).

The metaphlan results have a different format. Inspect the result file, and figure out how to read this and have it as a table in R. Here are some hints:

Read the file using read_delim() and with a tab delimiter.
Skip the lines starting with #. The option comment = "#" in read_delim() will enable this.
Give the columns names as you can see from the last # line in the file.

Then, once you have a table in R:

Filter to keep only the rows where a species is listed. Inspect the last rows of the first column. How do you detect if a row specifies a species? Use filter() in combination with str_detect().
Extract the NCBI taxonomy ID for the species only, and store this in a separate column named tax_id. Again, you need to inspect the table, to understand how to do this. Hint: To specify the symbol "|" as a separator or pattern, you need to write fixed("|") in R, since this symbol has a special meaning in regular expressions (thus, never use | as the separator symbol, like metaphlan does!).
Convert the tax_id to numeric.
The metaphlan may list the same species two or more times! Keep only the row with the first occurrence of any species. If you group_by(tax_id) and then slice(1) you will keep only the first occurrence of each tax_id.
The relative abundances are given as percent, turn them into fractions that sum to 1.0, and name this column metaphlan_abundance.
Finally, select only the tax_id and the metaphlan_abundance columns and store in an object.

Then, full_join() each of the tables above to the gold standard table.

Based on the results in the table from above, compute the $L_1$ norm between the gold standard abundance and each of the other three. Compute also the precision and recall for the three methods. Use absence threshold 0. Report these values, and write a short discussion of them.

Make a stacked bar chart plot. The plot should only display the gold standard species, and their abundances in all the four cases.

Rubrics

In order to pass this the RMarkdown report must contain:

The shell code (script) for running kraken2 and bracken using the standard as well as the custom database. You may have this in the same script.
The shell code (script) for running metaphlan.
A short text describing the main idea behind how metaphlan computes the taxonomic composition.
The R code for reading the results, and computing the $L_1$ norm, the precision and recall, and reporting these values.
The R code to make the stacked bar chart, and the figure itself.
A short text discussing the results.

Assignment 4 - Taxonomic classifiers