In this assignment you will estimate the species composition of a metagenome, using three different taxonomic classifiers, and then compare these results to a gold standard showing to see how close to the ‘true’ composition the different approaches come. This is typically something we do to evaluate the ‘goodness’ of the taxonomic classifiers.
This is based mostly on the topics from Module 9.
As before, you make a report in RMarkdown, and hand in the HTML-file on Canvas. As usual, you include essential code, figures and some short text discussing the results and the questions below.
Add links to GitHub or similar websites for the software tools you use. Think of this as a preliminary report in a larger scientific project.
In $COURSES/BIN310/assignment4/fastq/
you find pairs of
fastq files, one pair for each student username. You use the files with
your username only. These files contain paired-end reads after Illumina
shotgun sequencing of a metagenome.
In $COURSES/BIN310/assignment4/gold_standard/
you find
corresponding text files with student user names, and again you only use
the one having your username in it. This contains some columns with the
gold standard composition of your metagenome, i.e. it lists the species
and their relative abundances. The columns are separated by a blank
(" "
) in these files, not a tab.
kraken2
and bracken
First, make a script where you use kraken2
with default
settings and its standard database to classify the reads in
your fastq files. The standard database is found in
/mnt/databases/kraken2/Standard
.
Then use bracken
on the kraken2
report to
get final estimates of read counts for the taxa. Assign reads to the
species rank in bracken
.
Next, repeat the procedure above, using kraken2
and then
bracken
, but this time use the custom database in
$COURSES/BIN310/assignment4/krk2_db/
. This database
contains only a smallish set of genomes from which your data set
contains a random subset. Thus, this can be thought of as very close to
a ‘perfect’ database for classifying the data you have.
metaphlan
softwareNext, classify your data with the software metaphlan
. We
have not used this in the modules, but you should be able to do this on
your own.
Download a container from galaxy, and make certain you get a
container for metaphlan
version 4 (NB! not
metaphlan2
). Make a shell script for running
metaphlan
on your data, using the command line (you must
edit the container path/name as usual):
apptainer exec $HOME/assignment4/metaphlan:4.0.6--pyhca03a8a_0.sif metaphlan \
$r1,$r2 --input_type fastq --bowtie2out $out_dir/bwt2.bz2 --nproc $threads \
--index mpa_vOct22_CHOCOPhlAnSGB_202212 --bowtie2db $metaphlan_db -o $out_file
You need of course to specify all shell variables used in this
command line before the command line is executed in your
script. Use metaphlan_db=/mnt/databases/metaphlan_vOct22
,
which is where we have the database for this tool. The output will be a
simple text file. The rest you should manage on your own. Use the same
number of threads, memory and wall time as you did for
kraken2
.
Find documentation on metaphlan
on the internet, and
write a short description (max 10 sentences) of the main idea
behind how metaphlan
classifies reads. Focus on
In this part you must compare the results you got above to the gold standard. Everything is done by R coding.
Read your gold standard table into R, and select only the columns
tax_id
and abundance
(you may rename the
latter when select
ing it). From the exercises above, you
should have 3 different results for classifications of the same
metagenome:
kraken2
and bracken
.kraken2
and bracken
.metaphlan
results.Read these results into R, and store them as tables.
The bracken
tables give you some estimated relative
abundances. Rename this column indicating the method/database behind the
abundance estimates (e.g. bracken_standard_abundance
or
bracken_custom_abundance
).
The metaphlan
results have a different format. Inspect
the result file, and figure out how to read this and have it as a table
in R. Here are some hints:
read_delim()
and with a tab
delimiter.#
. The option
comment = "#"
in read_delim()
will enable
this.#
line in the file.Then, once you have a table in R:
filter()
in combination with
str_detect()
.tax_id
. Again, you
need to inspect the table, to understand how to do this. Hint: To
specify the symbol "|"
as a separator or pattern, you need
to write fixed("|")
in R, since this symbol has a special
meaning in regular expressions (thus, never use |
as the separator symbol, like metaphlan
does!).tax_id
to numeric.metaphlan
may list the same species two or more
times! Keep only the row with the first occurrence of any species. If
you group_by(tax_id)
and then slice(1)
you
will keep only the first occurrence of each tax_id
.metaphlan_abundance
.tax_id
and the
metaphlan_abundance
columns and store in an object.Then, full_join()
each of the tables above to the gold
standard table.
Based on the results in the table from above, compute the \(L_1\) norm between the gold standard abundance and each of the other three. Compute also the precision and recall for the three methods. Use absence threshold 0. Report these values, and write a short discussion of them.
Make a stacked bar chart plot. The plot should only display the gold standard species, and their abundances in all the four cases.
In order to pass this the RMarkdown report must contain:
kraken2
and
bracken
using the standard as well as the custom database.
You may have this in the same script.metaphlan
.metaphlan
computes the taxonomic composition.