1 Introduction

In this project we want to investigate how precisely we can estimate the abundances of the various taxa in a metagenome from Whole Genome sequencing data. We want to compare three slightly different approaches to arrive at such estimates, based on the same data:

  • From raw reads directly
  • From assembled contigs
  • From MAGs

In assignment 1 we did something similar, based on metabarcoding data from the amplicon sequencing of the 16S gene. Since this gene is highly conserved, it is difficult to discriminate between species with this marker gene. However, with WGS data we should be able to separate species from each other, and in this assignment a taxon is the same as a species.

You will need to build on your results from assignment 2 and 3 in this last assignment.

2 Data

This time we only use Illumina data since this is the most common type of sequencing. You use the raw Illumina data from assignment 2 still found in /mnt/courses/BIN310/assignment2/fastq/ in the first part below (Classification from reads).

In assignment 2 you made some contigs from these Illumina data, using metaspades. These contigs you use in the next section below (Classification from contigs). In this task you will also need the coverage results for each contig that you computed in assignment 3, before binning.

In assignment 3 you made some MAGs from the Illumina data (forget the Nanopore MAGs now). These you use in the third approach to finding taxon abundances below.

You use the gold standard for your data set to compare against in all three cases. You find a file with your data set gold standard information in /mnt/courses/BIN310/assignment4/gold_standard/. Note that this contains some more/different information than before even if it is for the exact same data.

We have data from group_id A and B, and we use them both. These two distinct samples typically contain the same taxa, but in very different abundances, which is why we want to use both in the comparisons below.

3 The gold standard

Make R code to read in your gold standard file and make a small table containing:

  • The column species with the distinct species names in each sample (A or B).
  • The column group_id indicating if it is the A or B sample.
  • The column true_abundance with the relative abundance for each species in each sample.

Note that the gold standard table has one row for each genome, and there may be several rows for the same species inside sample A and B. Make certain you summarise the abundances per species for each of the two samples.

Report how many distinct species you actually have in your metagenome.

Save this table (with save()) for later. You will compare your results below against this.

4 Classification from reads

We want to see to what extent the raw reads can directly give us information about which species we have in the metagenomes.

Use the Illumina raw reads for you data set found in /mnt/courses/BIN310/assignment2/fastq/.

Use kraken2 and then bracken to classify all read pairs in the A and B sample, respectively. Use the database found in /mnt/courses/BIN310/taxdb/kraken2/GTDB_complete.

Classify to the species rank. You should get one bracken report file from each of the two data sets. Compute relative abundances in each sample, add a column with group_id information, and bind them into one long table. This table should have columns species, group_id and read_abundance only.

Compare the results to the gold standard. Join your results with the gold standard, and replace any NA in any abundance by zero since a species not found is the same as a species with estimated abundance zero.

Compute the mean absolute error for each sample (A and B) by comparing the read abundances to the true abundances.

Make scatter-plots (geom_point()) of true abundances against read abundances for each of the two samples (A and B), and add the mean absolute error to the title of each plot.

5 Classification from contigs

We want to see to what extent all the assembled contigs give us information about which species we have in the metagenomes.

Use the Illumina contigs of minimum length 1000 that you also used in assignment 3. From the each sample you have many contigs who never survived the binning and quality check to become part of MAGs.

Use the metabuli tool to classify all these contigs, for A and B sample separately. Use the metabuli database in /mnt/courses/BIN310/taxdb/metabuli/GTDB_complete. Again, collect the species classifications only, i.e. contigs not classified to species are just discarded.

From assignment 3 you have the mean read coverage for each of these contigs. Use this to compute the relative abundance of each contig in each sample (A and B). Then summarise the abundance for each species and make a result table with columns species, group_id and contig_abundance.

Again, join with the gold standard as you did for the read abundances, and compute the same mean absolute errors and make corresponding plots.

6 Classification from MAGs

We want to see to what extent the MAGs give us information about which species we have in the metagenomes.

Use the MAGs you found in assignment 3 from the Illumina data only (not Nanopore).

Use gtdbtk to classify all MAGs into the GTDB taxonomy, using the full GTDB database. Collect the species information, and discard any MAGs who did not get a species classification.

You need to compute the mean read coverage for each MAG in each of the two samples (A and B). Use bowtie2 and samtools just like in assignment 3, but map the reads to an index consisting of all MAGs. Compute the mean read coverage for each MAG in each sample, and convcert this to relative abundances.

You should again end with a table with three columns: species, group_id and MAG_abundances.

Then, make the comparion to the gold standard in the exact same way as for the previous two approaches.

7 Top 10 bar chart

We want to compare the relative abundances of the top 10 species in each of the metagenomes only. Above we used all species, but often the most abundant ones are the most interesting ones.

Consider only the top 10 most abundant species in each sample (A and B) according to your gold standard data (if there are less than 10 species, use all). Filter to keep only the corresponding results from the three result tables above, and join it all into one big table.

Use this to make two (A and B) stacked bar charts showing one bar for each case (gold standard, reads, contigs, MAGs) and with one colored sector for each species.

8 Discussion

You have now tried to estimate the species abundances from the same data, but at three different ‘levels’ (raw reads, assembled contigs, polished MAGs).

What are the pros and cons of the three approaches? Which results are best and poorest?

How good are the results, and you may compare to what you found back in assignment 1. Some would say these results are artificially good, why?

9 Rubrics

In order to pass this the RMarkdown report must contain:

  • The R code to read and prepare the gold standard data.
  • The shell script and subsequent R code required to do the classifications from reads, compute the errors and plot the results.
  • The shell script and subsequent R code required to do the classifications from contigs, compute the errors and plot the results.
  • The shell scripts and subsequent R code required to do the classifications from MAGs, compute the errors and plot the results.
  • The numerical results and figures specified in the various sections.
  • Text explaining briefly what you do and why before/after code chunks. If you code deviates a lot from what is seen in the Module documents, this is especially important. Explain why you choose (entirely) different approaches.
  • Text discussing the results, as specified in the text above.
  • How you used AI and where.