We want to explore how two different ways of processing metabarcoding data may affect the results we get. The results we refer to are diversity measures, both alpha and beta diversities.
This is based mostly on the topics from Module 10 and 11.
As before, you make a report in RMarkdown, and hand in the HTML-file on Canvas. As usual, you include essential code, figures and some short text discussing the results and the questions below.
Add links to GitHub or similar websites for the software tools you use. Think of this as a preliminary report in a larger scientific project.
In $COURSES/BIN310/assignment5/sample_tables/
you find a
text file for each student, containing a sample table, i.e. a table
listing for each row some information about metabarcoding data samples.
Use the table with your username. You may copy the table to your own
folder.
Each table has columns sample_id
, R1_file
,
R2_file
and station
, and has 30 rows,
i.e. there are 30 samples. The samples have been taken from sediments at
various distances from a fish farm. The samples with
station
equal to 1 are from close to the fish farm where
the sediments are most influenced by the farm. Then at
station
s 2,3,..,6 the samples are taken gradually further
away from the fish farm, being more and more unaffected by the fish
farming.
The files named in the columns R1_file
and
R2_file
are all found in
$COURSES/BIN310/assignment5/fastq/
. You must not copy these
files, just read them from where they are.
dada2
denoisingMake the code needed to process your samples using the
dada2
approach. You only need the read count table output,
no taxonomy this time.
Report how many ASVs that dada2
identified.
UNOISE
denoisingAs an alternative to denoising with dada2
you will use
the UNOISE method. This is implemented in the
vsearch
software. The only difference to what we did in
module 10 is that when you do the OTU-clustering itself (part 2), you
replace the code:
--cluster_size $tmp_folder/all_derep_minsize.fasta \
--id $OTU_identity \
with
--cluster_unoise $tmp_folder/all_derep_minsize.fasta \
--unoise_alpha 2 \
Make the code to run vsearch
using the
UNOISE
algorithm to denoise your data. Again, the results
you need from this is the read count table only.
Report how many OTUs that UNOISE
identified.
Make phyloseq
objects for the results from
dada2
and UNOISE
.
Plot the alpha-diversity, using the Shannon measure, for the samples,
and color by the station
information. Do this for both the
dada2
and the UNOISE
results. Add some text
explaining what the figures show:
station
, i.e. distance away from
fish farm?dada2
versus UNOISE
)?We want to see if the Shannon diversity values are affected by how
many reads we have in the various samples. Compute for each sample the
Shannon diversity and the library size (total number of reads) for both
the dada2
and UNOISE
results. Put this into a
table with columns shannon
, library_size
and
method
, the latter being either dada2
or
UNOISE
. Then, make a scatter-plot of shannon
values versus library_size
and color by
method
.
There are 5 samples from each station
. We expect these
to be rather similar to each other. We are now interested in if these
samples become more or less similar to each other depending on if we
used dada2
or the UNOISE
method for processing
the data.
To quantify this, consider first the dada2
results.
Compute beta diversities between all samples using the Bray-Curtis
dissimilarity. Store this in a matrix (30x30). Then, collect all
Bray-Curtis values between samples within the same station
.
Since there are 5 samples for each station
, there should be
10 distances within each station
. Store this in a table
with the columns
bray_curtis
. This column contains the Bray-Curtis
values, one for each pair of samples.station
. The station
number for the
various Bray-Curtis values.method
. Here simply "dada2"
in all
rows.Then, repeat this and compute a similar table from the
UNOISE
results (with method
equal
"UNOISE"
in all rows).
Finally, bind the tables together, and make a box-plot of the
Bray-Curtis values where you use method
on the x-axis and
split into one panel for each station
(use
facet_wrap(~station)
).
Add some text explaining what the figures show:
station
?method
?With the goal of comparing the methods, the figure above is sub-optimal. The point is we have pairwise data. The Bray-Curtis values we get from the two methods are for the same pairs of samples. Arrange instead the data in a table with the columns
station
BC_dada2
. The Bray-Curtis values from above, from the
dada2
data.BC_unoise
. The Bray-Curtis values from above, from the
UNOISE
data.Then, mutate a new column with the differences
BC_dada2 - BC_unoise
and name this difference
.
Also, mutate the station
column to be a
factor
.
Finally, fit a linear model (lm()
) with the
difference
as response and station
as
predictor. Run a summary()
on the fitted model object.
Add some text explaining the output and the important results:
In order to pass this the RMarkdown report must contain:
dada2
.vsearch
with the
UNOISE
denoising.phyloseq
objects and computing/plotting alpha diversity
results.