In this project we want to investigate how precisely we can estimate the abundances of the various taxa in a metagenome from Whole Genome sequencing data. We want to compare three slightly different approaches to arrive at such estimates, based on the same data:
In assignment 1 we did something similar, based on metabarcoding data from the amplicon sequencing of the 16S gene. Since this gene is highly conserved, it is difficult to discriminate between species with this marker gene. However, with WGS data we should be able to separate species from each other, and in this assignment a taxon is the same as a species.
You will need to build on your results from assignment 2 and 3 in this last assignment.
This time we only use Illumina data since this is the most
common type of sequencing. You use the raw Illumina data from assignment
2 still found in /mnt/courses/BIN310/assignment2/fastq/ in
the first part below (Classification from reads).
In assignment 2 you made some contigs from these Illumina data, using
metaspades. These contigs you use in the next section below
(Classification from contigs). In this task you will also need the
coverage results for each contig that you computed in assignment 3,
before binning.
In assignment 3 you made some MAGs from the Illumina data (forget the Nanopore MAGs now). These you use in the third approach to finding taxon abundances below.
You use the gold standard for your data set to compare against in all
three cases. You find a file with your data set gold standard
information in
/mnt/courses/BIN310/assignment4/gold_standard/. Note that
this contains some more/different information than before even if it is
for the exact same data.
We have data from group_id A and B, and we use them
both. These two distinct samples typically contain the same taxa, but in
very different abundances, which is why we want to use both in the
comparisons below.
Make R code to read in your gold standard file and make a small table containing:
species with the distinct species names in
each sample (A or B).group_id indicating if it is the A or B
sample.true_abundance with the relative abundance
for each species in each sample.Note that the gold standard table has one row for each genome, and there may be several rows for the same species inside sample A and B. Make certain you summarise the abundances per species for each of the two samples.
Report how many distinct species you actually have in your metagenome.
Save this table (with save()) for later. You will
compare your results below against this.
We want to see to what extent the raw reads can directly give us information about which species we have in the metagenomes.
Use the Illumina raw reads for you data set found in
/mnt/courses/BIN310/assignment2/fastq/.
Use kraken2 and then bracken to classify
all read pairs in the A and B sample, respectively. Use the database
found in
/mnt/courses/BIN310/taxdb/kraken2/GTDB_complete.
Classify to the species rank. You should get one
bracken report file from each of the two data sets. Compute
relative abundances in each sample, add a column with
group_id information, and bind them into one long table.
This table should have columns species,
group_id and read_abundance only.
Compare the results to the gold standard. Join your results
with the gold standard, and replace any NA in any abundance
by zero since a species not found is the same as a species with
estimated abundance zero.
Compute the mean absolute error for each sample (A and B) by comparing the read abundances to the true abundances.
Make scatter-plots (geom_point()) of true abundances
against read abundances for each of the two samples (A and B), and add
the mean absolute error to the title of each plot.
We want to see to what extent all the assembled contigs give us information about which species we have in the metagenomes.
Use the Illumina contigs of minimum length 1000 that you also used in assignment 3. From the each sample you have many contigs who never survived the binning and quality check to become part of MAGs.
Use the metabuli tool to classify all these contigs, for
A and B sample separately. Use the metabuli database in
/mnt/courses/BIN310/taxdb/metabuli/GTDB_complete. Again,
collect the species classifications only, i.e. contigs not classified to
species are just discarded.
From assignment 3 you have the mean read coverage for each of these
contigs. Use this to compute the relative abundance of each contig in
each sample (A and B). Then summarise the abundance for each species and
make a result table with columns species,
group_id and contig_abundance.
Again, join with the gold standard as you did for the read abundances, and compute the same mean absolute errors and make corresponding plots.
We want to see to what extent the MAGs give us information about which species we have in the metagenomes.
Use the MAGs you found in assignment 3 from the Illumina data only (not Nanopore).
Use gtdbtk to classify all MAGs into the GTDB taxonomy,
using the full GTDB database. Collect the species information,
and discard any MAGs who did not get a species classification.
You need to compute the mean read coverage for each MAG in each of
the two samples (A and B). Use bowtie2 and
samtools just like in assignment 3, but map the reads to an
index consisting of all MAGs. Compute the mean read coverage
for each MAG in each sample, and convcert this to relative
abundances.
You should again end with a table with three columns:
species, group_id and
MAG_abundances.
Then, make the comparion to the gold standard in the exact same way as for the previous two approaches.
We want to compare the relative abundances of the top 10 species in each of the metagenomes only. Above we used all species, but often the most abundant ones are the most interesting ones.
Consider only the top 10 most abundant species in each sample (A and B) according to your gold standard data (if there are less than 10 species, use all). Filter to keep only the corresponding results from the three result tables above, and join it all into one big table.
Use this to make two (A and B) stacked bar charts showing one bar for each case (gold standard, reads, contigs, MAGs) and with one colored sector for each species.
You have now tried to estimate the species abundances from the same data, but at three different ‘levels’ (raw reads, assembled contigs, polished MAGs).
What are the pros and cons of the three approaches? Which results are best and poorest?
How good are the results, and you may compare to what you found back in assignment 1. Some would say these results are artificially good, why?
In order to pass this the RMarkdown report must contain: