Introduction

In this assignment you will estimate the species composition of a metagenome, using three different taxonomic classifiers, and then compare these results to a gold standard showing to see how close to the ‘true’ composition the different approaches come. This is typically something we do to evaluate the ‘goodness’ of the taxonomic classifiers.

This is based mostly on the topics from Module 9.

As before, you make a report in RMarkdown, and hand in the HTML-file on Canvas. As usual, you include essential code, figures and some short text discussing the results and the questions below.

Add links to GitHub or similar websites for the software tools you use. Think of this as a preliminary report in a larger scientific project.

The data

In $COURSES/BIN310/assignment4/fastq/ you find pairs of fastq files, one pair for each student username. You use the files with your username only. These files contain paired-end reads after Illumina shotgun sequencing of a metagenome.

In $COURSES/BIN310/assignment4/gold_standard/ you find corresponding text files with student user names, and again you only use the one having your username in it. This contains some columns with the gold standard composition of your metagenome, i.e. it lists the species and their relative abundances. The columns are separated by a blank (" ") in these files, not a tab.

Standard kraken2 and bracken

First, make a script where you use kraken2 with default settings and its standard database to classify the reads in your fastq files. The standard database is found in /mnt/databases/kraken2/Standard.

Then use bracken on the kraken2 report to get final estimates of read counts for the taxa. Assign reads to the species rank in bracken.

Customized database

Next, repeat the procedure above, using kraken2 and then bracken, but this time use the custom database in $COURSES/BIN310/assignment4/krk2_db/. This database contains only a smallish set of genomes from which your data set contains a random subset. Thus, this can be thought of as very close to a ‘perfect’ database for classifying the data you have.

The metaphlan software

Next, classify your data with the software metaphlan. We have not used this in the modules, but you should be able to do this on your own.

Download a container from galaxy, and make certain you get a container for metaphlan version 4 (NB! not metaphlan2). Make a shell script for running metaphlan on your data, using the command line (you must edit the container path/name as usual):

apptainer exec $HOME/assignment4/metaphlan:4.0.6--pyhca03a8a_0.sif metaphlan \
$r1,$r2 --input_type fastq --bowtie2out $out_dir/bwt2.bz2 --nproc $threads \
--index mpa_vOct22_CHOCOPhlAnSGB_202212 --bowtie2db $metaphlan_db -o $out_file

You need of course to specify all shell variables used in this command line before the command line is executed in your script. Use metaphlan_db=/mnt/databases/metaphlan_vOct22, which is where we have the database for this tool. The output will be a simple text file. The rest you should manage on your own. Use the same number of threads, memory and wall time as you did for kraken2.

Find documentation on metaphlan on the internet, and write a short description (max 10 sentences) of the main idea behind how metaphlan classifies reads. Focus on

Compare species classifications

In this part you must compare the results you got above to the gold standard. Everything is done by R coding.

Read your gold standard table into R, and select only the columns tax_id and abundance (you may rename the latter when selecting it). From the exercises above, you should have 3 different results for classifications of the same metagenome:

Read these results into R, and store them as tables.

The bracken tables give you some estimated relative abundances. Rename this column indicating the method/database behind the abundance estimates (e.g. bracken_standard_abundance or bracken_custom_abundance).

The metaphlan results have a different format. Inspect the result file, and figure out how to read this and have it as a table in R. Here are some hints:

Then, once you have a table in R:

Then, full_join() each of the tables above to the gold standard table.

Based on the results in the table from above, compute the \(L_1\) norm between the gold standard abundance and each of the other three. Compute also the precision and recall for the three methods. Use absence threshold 0. Report these values, and write a short discussion of them.

Make a stacked bar chart plot. The plot should only display the gold standard species, and their abundances in all the four cases.

Rubrics

In order to pass this the RMarkdown report must contain: