Introduction

We want to explore how two different ways of processing metabarcoding data may affect the results we get. The results we refer to are diversity measures, both alpha and beta diversities.

This is based mostly on the topics from Module 10 and 11.

As before, you make a report in RMarkdown, and hand in the HTML-file on Canvas. As usual, you include essential code, figures and some short text discussing the results and the questions below.

Add links to GitHub or similar websites for the software tools you use. Think of this as a preliminary report in a larger scientific project.

The data

In $COURSES/BIN310/assignment5/sample_tables/ you find a text file for each student, containing a sample table, i.e. a table listing for each row some information about metabarcoding data samples. Use the table with your username. You may copy the table to your own folder.

Each table has columns sample_id, R1_file, R2_file and station, and has 30 rows, i.e. there are 30 samples. The samples have been taken from sediments at various distances from a fish farm. The samples with station equal to 1 are from close to the fish farm where the sediments are most influenced by the farm. Then at stations 2,3,..,6 the samples are taken gradually further away from the fish farm, being more and more unaffected by the fish farming.

The files named in the columns R1_file and R2_file are all found in $COURSES/BIN310/assignment5/fastq/. You must not copy these files, just read them from where they are.

The `dada2` denoising

Make the code needed to process your samples using the dada2 approach. You only need the read count table output, no taxonomy this time.

Report how many ASVs that dada2 identified.

The `UNOISE` denoising

As an alternative to denoising with dada2 you will use the UNOISE method. This is implemented in the vsearch software. The only difference to what we did in module 10 is that when you do the OTU-clustering itself (part 2), you replace the code:

 --cluster_size $tmp_folder/all_derep_minsize.fasta \
 --id $OTU_identity \

with

 --cluster_unoise $tmp_folder/all_derep_minsize.fasta \
 --unoise_alpha 2 \

Make the code to run vsearch using the UNOISE algorithm to denoise your data. Again, the results you need from this is the read count table only.

Report how many OTUs that UNOISE identified.

Alpha diversity

Make phyloseq objects for the results from dada2 and UNOISE.

Plot the alpha-diversity, using the Shannon measure, for the samples, and color by the station information. Do this for both the dada2 and the UNOISE results. Add some text explaining what the figures show:

Is there an effect of station, i.e. distance away from fish farm?
Is there an effect of the processing method (dada2 versus UNOISE)?

We want to see if the Shannon diversity values are affected by how many reads we have in the various samples. Compute for each sample the Shannon diversity and the library size (total number of reads) for both the dada2and UNOISE results. Put this into a table with columns shannon, library_size and method, the latter being either dada2or UNOISE. Then, make a scatter-plot of shannon values versus library_size and color by method.

Is there a visible effect of library size on the Shannon diversity measure?
Is there a visible effect of processing method?

Beta diversities

There are 5 samples from each station. We expect these to be rather similar to each other. We are now interested in if these samples become more or less similar to each other depending on if we used dada2 or the UNOISE method for processing the data.

To quantify this, consider first the dada2 results. Compute beta diversities between all samples using the Bray-Curtis dissimilarity. Store this in a matrix (30x30). Then, collect all Bray-Curtis values between samples within the same station. Since there are 5 samples for each station, there should be 10 distances within each station. Store this in a table with the columns

bray_curtis. This column contains the Bray-Curtis values, one for each pair of samples.
station. The station number for the various Bray-Curtis values.
method. Here simply "dada2" in all rows.

Then, repeat this and compute a similar table from the UNOISE results (with method equal "UNOISE" in all rows).

Finally, bind the tables together, and make a box-plot of the Bray-Curtis values where you use method on the x-axis and split into one panel for each station (use facet_wrap(~station)).

Add some text explaining what the figures show:

Is there a visible effect of station?
Is there a visible effect of method?

Pairwise data

With the goal of comparing the methods, the figure above is sub-optimal. The point is we have pairwise data. The Bray-Curtis values we get from the two methods are for the same pairs of samples. Arrange instead the data in a table with the columns

station
BC_dada2. The Bray-Curtis values from above, from the dada2 data.
BC_unoise. The Bray-Curtis values from above, from the UNOISE data.

Then, mutate a new column with the differences BC_dada2 - BC_unoise and name this difference. Also, mutate the station column to be a factor.

Finally, fit a linear model (lm()) with the difference as response and station as predictor. Run a summary() on the fitted model object.

Add some text explaining the output and the important results:

Is there a significant difference between the methods with respect to Bray-Curtis dissimilarity between samples?
If you were asked which of these two methods you would prefer for processing such data, what would you answer?

Rubrics

In order to pass this the RMarkdown report must contain:

The code (shell and R) for running dada2.
The shell code for running vsearch with the UNOISE denoising.
The R code for reading the results, creating the phyloseq objects and computing/plotting alpha diversity results.
The figures of the alpha diversities and the text explaining what they show, answering the questions.
The R code to compute the Bray-Curtis values and make the box-plots. Some text explaining what the figures show, answering the questions.
The R code to compute the differences in the pairwise data, fitting the linear model, and some text answering the questions.

Assignment 5 - Metabarcoding data