phyloseq
R package for
microbiota data analysis
phyloseq
R packageIn the previous module we saw how we can get from a set of reads from many samples of amplicon data, to some read count matrix with accompanying data tables. In this module we will have a look at how we can get meaningful and interpretable information out of such a read count data.
In this module we will look at some diversity measures for the data
and also how we can visualize these in such a way that it gives us some
interesting information about the problem. We will throughout the module
look at the results both from dada2
and
vsearch
. These should basically reflect the “world”. If
they differ, this is method dependent and not population dependent.
Which is more correct? We leave the question open for reflection.
The analysis of read count table data is common, and this has resulted in a set of ‘standard’ approaches. This has also spawned software solutions (pipelines) where the exact same set of results/figures are computed each time. It is always a balance here, between making science reproducible and mundane. We will go thorough one very commonly used package but emphasizes the fact that we only cover a small part of what can be done with count data and that it is the aim of study and not a pipline that should determine what should be included next time you do this.
We will have a look at the phyloseq
package in R. This
is also a software solution for ‘standardizing’ the analysis of such
data, but since it is within the R language universe it is possible, and
sometimes necessary, to go beyond the tools provided by this package in
order to obtain the desired results. Keep in mind that learning how to
‘speak the R language’ is way much more important then simply learning
some standard use of this package (or any other package).
In the next module we will run some analysis on the same data to
substantiate what we found this week. It involves statistics at a fairly
advanced level, and could very well have been a separate course on its
own. We will only focus on the most generic parts of it.
phyloseq
packageFirst, we need to install the phyloseq
package. We can
find specifications for installation on the Bioconductor
homepage Phyloseq. From this page we see that the following code
will install Phyloseq. Remember,
do this only once! It can take some time so be patient:
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("phyloseq")
NOTE: When asked if you want to update packages, please
answer NONE (n) to this! If not, some of the packages will then
be attempted re-installed for a version that these servers for some
reason cannot install, and you get some hassle.
phyloseq
objectFrom Module 10 we have several files from both dada2
and
vsearch
. For both we have:
Three tables called readcounts.txt
,
sample_table.txt
and taxonomy.txt
, with either
dada2
or vsearch
as prefix. We need to convert
them to “fit” to a phyloseq
object. In the
phyloseq
package there are a range of desirable function
designed to operate specifically on such phyloseq
objects.
To begin with we need to design these three tables to specific formats
with row- and column names.
readcount.dta
, must be a numeric matrix with one row
for each sequence cluster (OTU or ASV) and one column for each sample.
Row names and column names are identifiers for clusters/samples,
respectively.sample.dta
, must be a data frame with one row for
each sample. The row names sample_id
contains the sample
identifiers that links it to the column names of
readcount.txt
.taxon.dta
, must be a character matrix with one row
for each cluster. The row names must match the OTU names in
readcount.dta
.Thus, by converting our txt files to either matrix or data frame, we
may use some of the pre made functions in the phyloseq package. Let us
construct a phyloseq
object from the the dada2
output in the last module:
library(tidyverse)
library(phyloseq)
### We load results from module 10
# OTU/ASV table
readcount <- read_delim("/mnt/courses/BIN310/module11/dada2_readcounts.txt") # Reading the txt file.
### We need to convert the readcount to a matrix and the first row to row names instead of a column
readcount.dta <- data.matrix(readcount[,-1])
rownames(readcount.dta) <- readcount$ASV
# Sample table
sampletable <- read_delim("/mnt/courses/BIN310/module11/sample_table.txt")
### We need to convert the first row to row names instead
sample.dta <- data.frame(sampletable[,-1], row.names=sampletable$sample_id)
# Taxonomy table
taxonomy <- read_delim("/mnt/courses/BIN310/module11/dada2_taxonomy.txt")
### We need to convert taxonomy to a matrix only contain taxonomy and use Header as row names:
taxonomy %>%
select(-Header, -Sequence) %>%
as.matrix() -> taxon.dta
rownames(taxon.dta) <- taxonomy$Header
# Now, lets create the phyloseq object
physeq.obj <- phyloseq(otu_table(readcount.dta, taxa_are_rows = T),
sample_data(sample.dta),
tax_table(taxon.dta))
print(physeq.obj)
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 33 taxa and 15 samples ]
## sample_data() Sample Data: [ 15 samples by 4 sample variables ]
## tax_table() Taxonomy Table: [ 33 taxa by 7 taxonomic ranks ]
If something went wrong, here is some details to check:
In the phyloseq
terminology the read count matrix is
called an otu_table
, and we use the function
otu_table()
to create this. This must be a numeric matrix
(not a table!) with read counts, and the name otu_table
is
at best confusing. Note that we have to indicate if taxa are in
the rows or in the columns, i.e. both configurations are allowed.
Next, the table with sample data, and the function for creating it,
is named sample_data
in phyloseq
. This must be
defined as a data frame, and our sampletable
can be used,
with one modification. The sample identifiers we have in the
sample_id
column must be row names of this table. We
therefore create sample.dta
which has correct row names and
is defined as a data.frame.
Finally the tax_table
is also a matrix(!) and must have
the cluster identifiers in the row names. It should only contain the
taxonomy, not the sequences like we have in our
taxonomy
.
There is one more element we may add to our physeq.obj
.
For some analyses we need a phylogenetic tree based on the
cluster sequences. This must be an object of the type phylo
from the ape
package. We may come back to this in the next
module.
We have now managed to get our data as a phyloseq object and can use
functions within this package. Try to write
help(package = "phyloseq")
in the Console to get an
overview. You are now welcome to play around with this functions. Let us
start by plotting and summarizing data.
Barplots is a nice way to visualize community data, and of course the
phyloseq
have a function for this:
p.bar <- plot_bar(physeq.obj, fill = "Genus")
print(p.bar)
Notice that bars have different heights, since we now base this on
read counts, not relative abundances. The ggplot
object
returned we may manipulate by adding more layers:
p.bar2 <- p.bar + facet_grid(~Phylum)
print(p.bar2)
There are some phyloseq
functions for sub-setting and
summarize the data.
Here is how we can sum the total read count for each OTU:
total.abundance <- taxa_sums(physeq.obj)
It is simply rowSums()
of the otu_table
(which is a matrix…).
We can subset the data into another phyloseq
object.
Here we use prune_taxa()
to select only a subset of the
OTUs. We need the identifiers for the OTUs we want to keep. In this case
we keep the 10 most abundant ones:
most.abundant.10 <- names(sort(total.abundance, decreasing = T))[1:10]
physeq.obj.top10 <- prune_taxa(most.abundant.10, physeq.obj)
print(physeq.obj.top10)
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 10 taxa and 15 samples ]
## sample_data() Sample Data: [ 15 samples by 4 sample variables ]
## tax_table() Taxonomy Table: [ 10 taxa by 7 taxonomic ranks ]
prune_taxa
simply subset the OTU, for a given argument,
in our case the most abundant OTUs. There is also a
subset_taxa()
function in phyloseq
, doing
exactly what it says: Subsetting the data based on taxonomic input:
physeq.obj.bacteroides <- subset_taxa(physeq.obj, Genus == "Bacteroides")
There is also a prune_samples()
for subsetting the
samples instead of the OTUs. Note that all parts of the data structure
(physeq.obj
) is affected by this subsetting. It should be
noted that if you first run a psmelt()
(see further down in
the document) and then used the standard filter()
function,
you could do similar subsettings. However, the phyloseq
data structure is needed for some analyses, and the subsetting of the
melted table is only really useful for plotting.
The agglomeration of read counts means we merge and sum read
counts from several OTUs. The function tax_glom()
may be
used to merge all taxa in the same clade to a single taxon, e.g. merge
all genera under the same family. Their read counts are then summed:
physeq.obj.family <- tax_glom(physeq.obj, taxrank = "Family")
Inspect the otu_table
and tax_table
for
this new object. You need to name a rank in the taxonomy where you want
the agglomeration.
Agglomerations are useful. Sometimes we want to make our analyses at
a higher taxonomic rank than the lowest (Genus), and use
tax_glom()
. Also, if you merge several data sets (see
merge_phyloseq()
), the OTUs centroid sequences will in
general differ slightly, even if they actually represent the same taxon.
This has to do with the clustering method used. Then, by merging very
close OTUs, using tip_glom()
and a small value for
h
, you may get a more relevant set of OTUs for the
community.
phyloseq
We only briefly touch upon all possibilities in phyloseq
here. Search the net for more information. Here are a couple of places
to start:
vsearch
Use the code above and the files from vsearch
from last
module to make a new phyloseq object
of the
vsearch
results, call it physeq.objV
. You find
the files in the same folder as the dada2
files we read
above.
Examine the two different phyloseq objects
, what is
different and what is similar?
### We load results from module 10 but switching dada2 with vsearch
# OTU/ASV table
readcount <- read_delim("/mnt/courses/BIN310/module11/vsearch_readcounts.txt") %>%
rename(OTU = `#OTU ID`)
### We need to convert the readcount to a matrix and the first row to row names instead of a column
readcount.dta <- data.matrix(readcount[,-1])
rownames(readcount.dta) <- readcount$OTU
# Sample table - is the same as for dada2 and we use this directly
# Taxonomy table
taxonomy <- read_delim("/mnt/courses/BIN310/module11/vsearch_taxonomy.txt")
### We need to convert taxonomy to a matrix only contain taxonomy and use Header as row names:
taxonomy %>%
select(-Header, -Sequence) %>%
as.matrix() -> taxon.dta
rownames(taxon.dta) <- readcount$OTU #readcount column one has the right names
# Now, lets create the phyloseq object
physeq.objV <- phyloseq(otu_table(readcount.dta, taxa_are_rows = T),
sample_data(sample.dta),
tax_table(taxon.dta))
print(physeq.objV)
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 25 taxa and 15 samples ]
## sample_data() Sample Data: [ 15 samples by 4 sample variables ]
## tax_table() Taxonomy Table: [ 25 taxa by 7 taxonomic ranks ]
Make a barplots for the phyloseq object
from
vsearch
, physeq.objV
, for all genera and then
only for the five most abundant OTUs.
p.barV <- plot_bar(physeq.objV, fill = "genus") #In the tax.file genus is now written with lower case "g"
total.abundance <- taxa_sums(physeq.objV) #calculate abundance
most.abundant.5 <- names(sort(total.abundance, decreasing = T))[1:5] #use the 5 top abundant OTU
physeq.obj.top5 <- prune_taxa(most.abundant.5, physeq.objV) #prune the phyloseq object
p.barV5 <- plot_bar(physeq.obj.top5, fill = "genus") #In the tax.file genus is now written with lower case "g"
print(p.barV5)
You can have multiple figures side by side in RMarkdown by specifying
fig.show="hold"
along with the out.width
option. Set out.width="50%
and plot the two barplots, not
pruned, without splitting on phylum, side by side for easier comparison.
In plain RStudio you can use the par()
function for plots
in same figure.
Compare and discuss with a fellow student or yourself the differences and why it is not the same.
p.bar + ggtitle("dada2")
p.barV + ggtitle("vsearch")
We need to keep in mind that the picture we look at here only is a sample of the true microbial community. We are unable, both physical and technical to obtain all taxa and their composition in one environment. This is only the case if we grow the bacteria in a closed environment at the laboratory. We sample from an bigger environment and then we use these samples to “guess something” about the true environment. Hence, the samples are not of particular interest, except that they (hopefully) reflect the composition from the environment sampled. This is equivalent with the theory from any statistical course.
We will now look at some measures that we can be used for analysis of the data. These measures are usually used to reflect the environment the samples comes from and draw conclusions, but keep in mind, it is only a sample depending of both sampling, the technology that sequenced it and all the the pre-steps we have done to come here.
In this module we will consider two diversities measures. I. Alpha
diversity - diversity within a sample and II. Beta diversity - distance
between samples.
The alpha diversity of a sample indicates how large the
diversity is within the sample. A large diversity means it contains many
different taxa, or conversely, a small diversity means the sample is
completely dominated by a very few taxa. There is a range of statistics
we may compute for indicating diversity. Here we plot three of them for
both our phyloseq
objects:
p.div <- plot_richness(physeq.obj, color = "condition", measures = c("Observed","Shannon","Simpson"))
## Warning in estimate_richness(physeq, split = TRUE, measures = measures): The data you have provided does not have
## any singletons. This is highly suspicious. Results of richness
## estimates (for example) are probably unreliable, or wrong, if you have already
## trimmed low-abundance taxa from the data.
##
## We recommended that you find the un-trimmed data and retry.
print(p.div)
Shannon index
reflects how many different taxa there
are and how evenly they are distributed within a sample.Simpson index
Measures the relative abundance of
species and gives more weight to common or dominant species.The methods implemented here are from the vegan
R
package, and give a warning if we have no singeltons. A singelton is a
case where a taxon is observed once, i.e. 1 read. The vegan package was
never designed for read count data, but for ecological studies where the
taxa are species, and every ‘read’ is one observation of that species
(walk in the forest and observe species). In such tables the number of
observations are in general low, with singeltons being common. Have a
look at the Help for the function diversity()
from the
vegan
package for more details.
All the plots generated by the phyloseq
package are
based on ggplot
and returns such objects. This means we may
manipulate them further using functions from the ggplot2
package.
More on alpha-diversity here:
In ecology a beta-diversity reflects the difference in composition between communities. We may also think of it as a distance, even if not all beta-diversity statistics are strictly metrics in the mathematical sense. How different is one sample from another? This is the type of question we try to shed some light on by computing beta diversity statistics.
Consider two samples from our read count matrix, \(Y_{*j}\) and \(Y_{*k}\).
The Manhattan distance between two compositions is as follow:
\[Mnt(j,k) = \sum_{i=1}^n |Y_{i,j} - Y_{i,k}|\] but slightly more common is the euclidean distance
\[Ecd(j,k) = \sqrt{\sum_{i=1}^n |Y_{i,j} - Y_{i,k}|^2}\] The only difference is the squaring and then square root. Manhattan distance is the sum of difference between the x- and y- coordinate and has its name from the streets of Manhattan i New York City, while euclidean distance is what we think of as distance in daily life (length of straight line between two points).
Illustation:
Both are actually special cases of the more general minkowski distance formula: \[Mnk(j,k) = (\sum_{i=1}^n |Y_{i,j} - Y_{i,k}|^q)^{1/q}\] By choosing \(q=1\) you get manhattan and \(q=2\) euclidean distances. All the minkowski variants will result in distances who are somewhat similar to each other, and we typically stick to the euclidean representative, unless you have some good reason not to.
If you have CLR transformed data, the euclidean distance is a natural
choice of distance between samples.
There is a long list of distance or distance-like statistics that are used for read count data.
For example Unifrac distances but in order to compute such distances, you have to have a phylogenetic tree describing how all OTUs are related, which we have not used in this modul.
The Bray-Curtis is not really a distance, since it does not fulfill all mathematical requirements, but is still used a lot as a measure of dissimilarity. Here is the formula: \[Bcu(j,k) = \frac{\sum_{i=1}^n |Y_{i,j} - Y_{i,k}|}{\sum_{i=1}^n Y_{i,j} + Y_{i,k}}\]
Notice the numerator is the Manhattan distance we saw above. Here we divide it by the sum of all abundances in the two samples. This means the Bray-Curtis gives values between \(0.0\) (samples are identical) and \(1.0\) (consists of completely different OTUs). Notice that since we sum the differences in abundances, it is a good idea to normalize the data prior to a Bray-Curtis computation, see Normalization for more input here.
phyloseq
The phyloseq
package has a wrapper function for
computing a range of distance matrices between all pairs of samples. It
typically works like this:
dist.obj <- distance(<phyloseq object>, method = <"name of method">)
The returned dist.obj
is a standard distance object in R
that store only the lower triangle of the distance matrix. We may
convert it into a distance matrix by
as.matrix(dist.obj)
.
Compute the euclidean distance and the manhattan distance between
samples both for dada2
and vsearch
, using
method = "euclidean"
and method = "manhattan"
with physeq.obj
and physeq.objV
(non-normalized data), respectively. Store in different dist-objects.
Then make a hierarchical clustering based on the computed distance
object, and plot, like this:
plot(hclust(<dist.obj>), hang = -1)
We expect the samples from the three conditions (A, B and C) to form distinct groups. Do they?
Ecd <- distance(physeq.obj, method = "euclidean")
#plot(hclust(Ecd), hang = -1)
EcdV <- distance(physeq.objV, method = "euclidean")
#plot(hclust(EcdV), hang = -1)
Md <- distance(physeq.obj, method = "manhattan")
#plot(hclust(Md), hang = -1)
MdV <- distance(physeq.objV, method = "manhattan")
#plot(hclust(MdV), hang = -1)
The distances from above are most commonly used when doing some ordination of the data. This means plotting the samples as points in a two-dimensional space (x-y-diagram), where we get some information how similar/different they are compared to each other. Think of each sample as vector of coordinates in some high-dimensional space, with one axis (coordinate) for each OTU. By ordination we try to ‘compress’ this into a low (=2) dimensional space, and still preserve as much of the information as possible. Such plots will in general not show all relations between the samples, but hopefully enough for us to see some groupings or gradients. Ordination is by nature exploratory.
You may be familiar with Principal Component Analysis (PCA) from
elsewhere. This is an example of some ordination. If we want to make a
PCA-plot of our samples, we would take the transposed (samples in rows,
OTUs in columns) otu_table
and feed this to
prcomp()
. Then we would plot the samples as dots along the
two first principal axes. If you have used the CLR normalization from
above, a PCA is a natural way to do ordination of the data. Let us now
look at some alternatives to PCA.
For read count table data, the Principal Coordinate Analysis (PCoA) is popular. PCoA is also the same as multidimensional scaling (MDS). What is this? A major difference to PCA is this:
Imagine we have measured the flight distances between the cities Oslo, New York, Durban and Sydney. Here is the distance matrix (numbers in km):
cn <- c("Oslo", "New York", "Durban", "Sydney")
D <- matrix(c(0, 5914, 10153, 15950,
5914, 0, 13308, 15989,
10153, 13308, 0, 10546,
15950, 15989, 10546, 0), nrow = 4, byrow = T)
colnames(D) <- cn
rownames(D) <- cn
print(D)
## Oslo New York Durban Sydney
## Oslo 0 5914 10153 15950
## New York 5914 0 13308 15989
## Durban 10153 13308 0 10546
## Sydney 15950 15989 10546 0
Unless you belong to the flat-earth society, you realize these are distances measured along the outside of a sphere. Still, is it possible to display, in a two-dimensional plane, how these cities are located relative to each other? Thus, we want to
We realize it may not be possible to re-construct all distances in this flat diagram, but we should try to get as close as possible. This is exactly what MDS is all about. Here is some R code for doing this in the classical way:
X <- cmdscale(D)
as_tibble(X, rownames = "City") %>%
ggplot() +
geom_text(aes(x = V1, y = V2, label = City))
## Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if
## `.name_repair` is omitted as of tibble 2.0.0.
## ℹ Using compatibility `.name_repair`.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
You may find the cities have been placed a little strange compared to how we are used to think about these cities’ location? The axes have no meaning, in terms of north-south or east-west, it is only the relative location between cities we can re-construct. How well are the distances actually re-constructed here? Here is the answer:
D.hat <- as.matrix(round(dist(X)))
print(D.hat)
## Oslo New York Durban Sydney
## Oslo 0 5284 9849 15916
## New York 5284 0 13307 15907
## Durban 9849 13307 0 10449
## Sydney 15916 15907 10449 0
phyloseq
Let us make a PCoA plot of our raw data, using the the Bray-Curtis dissimilarity. First we compute the distances and then the ordination object:
bc <- distance(physeq.obj, method = "bray")
ord.obj <- ordinate(physeq.obj, method = "PCoA", distance = bc)
The first plot we make should always be a scree plot. This indicates how much of the total distance information we may capture along axis 1, 2,… up to the number of samples. We are typically only plotting the two first axes, but it is very important to get some information about how much we do not see when only looking through the ‘window’. Here is the result:
plot_scree(ord.obj)
Most of the information lies in the two first axis, but not everything. The steeper this is, the more information we get by plotting it in two dimensions.
Lets try to plot the ordination plot and see after groupings:
plot_ordination(physeq.obj, ord.obj, color = "condition")
From this plot we conclude that when measured by Bray-Curtis, the
samples group into three distinct groups by "condition"
,
some exceptions is it for Condition A and B, telling us there are
something unexplained that is not condition. This mean the three groups
of samples have somewhat similar composition within, and very different
between the conditions.
A version of MDS that is popular for read count data is the Non-metric MDS (NMDS). What is the difference to the metric MDS?
Here is a short video on YouTube that explains some major ideas of NMDS
An important concept for any MDS is the stress. When finding the new coordinates of our samples in the x-y-plane, (N)MDS does so by minimizing this stress. We start out with the original \(p\times p\) distance matrix \(D\), where \(p\) is the number of samples. Each sample should become point in our new x-y-diagram, and we start by placing them randomly. Let \(X\) be the \(p\times 2\) dimensional matrix of these coordinates. Then we can compute the euclidean distance matrix \(\hat{D}\) from these locations of the samples. We now want to shuffle the points around (change values in \(X\)) such that \(\hat{D}\) becomes as similar as possible to \(D\). What is meant by ‘as similar as possible’? This is where the stress come in.
In the metric MDS a stress function is typical \[Stress(X) = \frac{\sum_{i=1}^p \sum_{j=1}^p (d_{ij} - \hat{d}_{ij})^2}{\sum_{i=1}^p \sum_{j=1}^p d_{ij}^2}\]
where \(d_{ij}\) is cell in row \(i\) and column \(j\) of \(D\), and similar for \(\hat{D}\). The stress is simply the sum of squared differences in distances, divided by the sum of squared distances.
In the non-metric version this stress-function is changed to \[Stress(X) = \frac{\sum_{i=1}^p \sum_{j=1}^p (d_{ij} - f(\hat{d}_{ij}))^2}{\sum_{i=1}^p \sum_{j=1}^p d_{ij}^2}\] where the function \(f(\hat{d}_{ij})\) is a monotonic mapping of the real distances against the observed. In reality this means we place more emphasis on the ordering of the distances, i.e. those distances who are the smallest (largest) in the original \(D\) should also be the smallest (largest) in \(\hat{D}\). If this is the case we are happy, regardless if some of the (extreme) distances may deviate a lot from each other. An NMDS will typically be less sensitive to the extreme distances, in both directions.
Once we have found our best \(\hat{D}\) we compute the final stress. This should ideally be zero, and the larger it gets, the poorer we managed to approximate the original \(D\) with our \(\hat{D}\). A rough rule of thumb says that if this stress is above \(0.20\) we cannot trust the ordination plot to show us the true relations between the samples.
Let us make the NMDS version of the PCoA we did above:
set.seed(3) # set seed for random generator
nmds.obj <- ordinate(physeq.obj, method = "NMDS", distance = bc)
## Run 0 stress 0.05569319
## Run 1 stress 0.05569296
## ... New best solution
## ... Procrustes: rmse 0.00016019 max resid 0.0003688632
## ... Similar to previous best
## Run 2 stress 0.06893738
## Run 3 stress 0.09056592
## Run 4 stress 0.06893739
## Run 5 stress 0.05569303
## ... Procrustes: rmse 0.0007301191 max resid 0.001689554
## ... Similar to previous best
## Run 6 stress 0.09056588
## Run 7 stress 0.09056609
## Run 8 stress 0.1239057
## Run 9 stress 0.1391017
## Run 10 stress 0.06070222
## Run 11 stress 0.05569287
## ... New best solution
## ... Procrustes: rmse 0.0005064185 max resid 0.001171525
## ... Similar to previous best
## Run 12 stress 0.05569327
## ... Procrustes: rmse 0.0003912291 max resid 0.0009035739
## ... Similar to previous best
## Run 13 stress 0.05569326
## ... Procrustes: rmse 0.0003843763 max resid 0.0008882769
## ... Similar to previous best
## Run 14 stress 0.06893739
## Run 15 stress 0.055693
## ... Procrustes: rmse 0.0005406105 max resid 0.001250971
## ... Similar to previous best
## Run 16 stress 0.09056591
## Run 17 stress 0.05569297
## ... Procrustes: rmse 0.000166026 max resid 0.0003817002
## ... Similar to previous best
## Run 18 stress 0.09056597
## Run 19 stress 0.0556929
## ... Procrustes: rmse 0.0004404844 max resid 0.001019052
## ... Similar to previous best
## Run 20 stress 0.05569297
## ... Procrustes: rmse 0.0001658132 max resid 0.0003794088
## ... Similar to previous best
## *** Best solution repeated 7 times
Notice from the output that there is a search for the best positioning of the samples in the new x-y-space. After some search, the algorithm should hopefully converge on a solution. The NMDS fitting does not always converge, which is often a bad sign. In this case, however, it is because the stress become too low! It simply means the data are too ‘clean’. When we make the plot:
pp_nmds <- plot_ordination(physeq.obj, nmds.obj, color = "condition",
title = str_c("NMDS, stress=", format(nmds.obj$stress, digits = 3)))
pp_nmds + geom_point(alpha = 0.2, size = 10)
…it paints quite the same picture as above, only “flipped”. Condition C clearly group together, with low within variation. While Condition A and B both have some observations that clearly have somewhat different compositions. Look over the results we have seen earlier in the modul, are you able to recognize which ones this might be? The stress becomes low here, which is not surprising given the scree-plot we made before. A lack of convergence is something you should always look out for, even if most real life data sets will not have the problems of too low stress as we saw here (more commonly the opposite).
Remember to always indicate the stress-value in an NMDS plot, since this piece of information is needed to know how reliable the locations are.
Use the steps above, to calculate beta-diversities for the
vsearch
results with Bray-Curtis dissimilarity, for
comparison with the dada2
results.
#Beta-diversities:
bc <- distance(physeq.objV, method = "bray")
#PCoA
ord.obj <- ordinate(physeq.objV, method = "PCoA", distance = bc)
scree <- plot_scree(ord.obj) #Screeplot
print(scree)
pcoa <- plot_ordination(physeq.obj, ord.obj, color = "condition")
print(pcoa)
#NMDS
set.seed(3) # set seed for random generator
nmds.obj <- ordinate(physeq.objV, method = "NMDS", distance = bc)
## Run 0 stress 0.03824316
## Run 1 stress 0.08732245
## Run 2 stress 0.03824314
## ... New best solution
## ... Procrustes: rmse 4.217962e-05 max resid 0.0001087576
## ... Similar to previous best
## Run 3 stress 0.03824316
## ... Procrustes: rmse 5.503178e-05 max resid 0.0001424725
## ... Similar to previous best
## Run 4 stress 0.03824314
## ... Procrustes: rmse 5.628404e-06 max resid 1.073182e-05
## ... Similar to previous best
## Run 5 stress 0.09977718
## Run 6 stress 0.03824314
## ... Procrustes: rmse 1.886513e-06 max resid 4.007864e-06
## ... Similar to previous best
## Run 7 stress 0.03824314
## ... Procrustes: rmse 4.197161e-06 max resid 1.068885e-05
## ... Similar to previous best
## Run 8 stress 0.03824315
## ... Procrustes: rmse 2.505667e-05 max resid 6.42225e-05
## ... Similar to previous best
## Run 9 stress 0.03824314
## ... Procrustes: rmse 1.859144e-05 max resid 4.763153e-05
## ... Similar to previous best
## Run 10 stress 0.03824314
## ... Procrustes: rmse 7.073933e-06 max resid 1.826221e-05
## ... Similar to previous best
## Run 11 stress 0.03824315
## ... Procrustes: rmse 2.010389e-05 max resid 5.062851e-05
## ... Similar to previous best
## Run 12 stress 0.03824316
## ... Procrustes: rmse 4.663333e-05 max resid 0.0001197091
## ... Similar to previous best
## Run 13 stress 0.03824314
## ... Procrustes: rmse 8.496855e-06 max resid 2.149958e-05
## ... Similar to previous best
## Run 14 stress 0.03824314
## ... Procrustes: rmse 1.701685e-05 max resid 4.378684e-05
## ... Similar to previous best
## Run 15 stress 0.03824314
## ... Procrustes: rmse 2.961512e-06 max resid 6.496092e-06
## ... Similar to previous best
## Run 16 stress 0.03824315
## ... Procrustes: rmse 2.753349e-05 max resid 7.151864e-05
## ... Similar to previous best
## Run 17 stress 0.03824314
## ... Procrustes: rmse 9.23827e-06 max resid 2.359375e-05
## ... Similar to previous best
## Run 18 stress 0.03824314
## ... Procrustes: rmse 9.383536e-06 max resid 2.337688e-05
## ... Similar to previous best
## Run 19 stress 0.08732186
## Run 20 stress 0.1053096
## *** Best solution repeated 16 times
pp_nmdsV <- plot_ordination(physeq.objV, nmds.obj, color = "condition",
title = str_c("NMDS, stress=", format(nmds.obj$stress, digits = 3)))
pp_nmdsV + geom_point(alpha = 0.2, size = 10)
We have data processed in two ways, either with dada2
or
vsearch
. How does this influence the data, e.g. the
dissimilarity between the samples? With ordination plots, we see an
approximation to the distances. If we want to compare the methods, we
should look directly into the distance themselves, not just some
ordination plot.
In our data there are 5 samples from each condition. It would be
natural to expect these 5 samples are quite similar to each other given
that they come from the same conditions. But, is this affected by if we
used dada2
or vsearch
for processing the
data?
To quantify this, start with the dada2
data, and compute
distances between all pairs of samples using the Bray-Curtis
dissimilarity. Then we collect only those Bray-Curtis values who are
between samples from the same condition, and store these in a
table where we also have columns indicating the condition and the
method. Here is some code skeleton for doing this:
# Assume you have phyloseq objects from above
sample.dta <- sample_data(___)
cond <- unique(___$condition)
# All Bray-Curtis values, stored as a matrix
bc.mat <- distance(___, method = ___) %>%
as.matrix()
# Extracting those from within conditions
bc_dada2.tbl <- NULL
for(i in 1:___){
idx <- which(sample.dta$___ == cond[i])
bc_dada2.tbl <- tibble(bray_curtis = as.numeric(as.dist(bc.mat[idx,idx])),
condition = ___,
method = "___") %>%
bind_rows(bc_dada2.tbl)
}
Then you do similar for the vsearch
data. Finally, put
the two table together (bind_rows()
), and make a plot of
the Bray-Curtis distances for each condition, and color by method:
ggplot(___) +
geom_jitter(aes(x = ___, y = ___, color = ___),
width = 0.1, size = 4, alpha = 0.5)
Do you find any systematic difference between the methods?
sample.dta <- sample_data(physeq.obj)
cond <- unique(sample.dta$condition)
# The dada2
# All Bray-Curtis values, stored as a matrix
bc.mat <- distance(physeq.obj, method = "bray") %>%
as.matrix()
# Extracting those from within conditions
bc_dada2.tbl <- NULL
for(i in 1:length(cond)){
idx <- which(sample.dta$condition == cond[i])
bc_dada2.tbl <- tibble(bray_curtis = as.numeric(as.dist(bc.mat[idx,idx])),
condition = cond[i],
method = "dada2") %>%
bind_rows(bc_dada2.tbl)
}
# The vsearch
# All Bray-Curtis values, stored as a matrix
bc.mat <- distance(physeq.objV, method = "bray") %>%
as.matrix()
# Extracting those from within conditions
bc_vsearch.tbl <- NULL
for(i in 1:length(cond)){
idx <- which(sample.dta$condition == cond[i])
bc_vsearch.tbl <- tibble(bray_curtis = as.numeric(as.dist(bc.mat[idx,idx])),
condition = cond[i],
method = "vsearch") %>%
bind_rows(bc_vsearch.tbl)
}
# Putting into same table and plotting
bind_rows(bc_dada2.tbl, bc_vsearch.tbl) %>%
ggplot() +
geom_jitter(aes(x = condition, y = bray_curtis, color = method),
width = 0.1, size = 4, alpha = 0.5)
We cannot claim to see a clear difference ebtween the methods here. This is good news, since then the variation between samples from the same conditions does not depend on how we processed the data.
phyloseq
objectIt is often desirable to take a closer look at certain taxa, for
example by looking at the relative abundance. Even if
phyloseq
has some plotting functions, we may end up wanting
something special, and we would like to exploit our expertise in the
general use of ggplot
. An approach is then to ‘melt’ the
complex physeq.obj
into a big table:
ps.tbl <- psmelt(physeq.obj)
Inspect this table. Notice there is a column Abundance
containing all the read counts from the otu_table
, and the
remaining columns are from the additional tables. We may now use this to
create our own plots, e.g. a bar plot with relative abundances
pp_ra <- ggplot(ps.tbl) +
geom_col(aes(x = Sample, y = Abundance, fill = Genus), position = "fill") +
theme(axis.text.x = element_text(angle = 90)) +
guides(fill = guide_legend(keyheight = 0.1))
print(pp_ra)
We can now divide this by the different conditions:
pp_rac <- ggplot(ps.tbl) +
geom_col(aes(x = Sample, y = Abundance, fill = Genus), position = "fill") +
facet_grid(~condition, scales = "free", space = "free") +
theme(axis.text.x = element_text(angle = 90)) +
guides(fill = guide_legend(keyheight = 0.1))
print(pp_rac)
Can you see something familiar?
Use the vearch
results, and the phyloseq object
physeq.objV
, to make relative abundance barplot on phyla
level, both general and divided by condition. Like this:
ps.tbl <- psmelt(physeq.objV)
pp_raV <- ggplot(ps.tbl) +
geom_col(aes(x = Sample, y = Abundance, fill = phylum), position = "fill") +
theme(axis.text.x = element_text(angle = 90)) +
ggtitle("The overall relative abundance for Phyla") +
guides(fill = guide_legend(keyheight = 0.1))
pp_raV
pp_racV <- ggplot(ps.tbl) +
geom_col(aes(x = Sample, y = Abundance, fill = phylum), position = "fill") +
facet_grid(~condition, scales = "free", space = "free") +
theme(axis.text.x = element_text(angle = 90)) +
ggtitle("The realtive abundance for Phyla divded by condition") +
guides(fill = guide_legend(keyheight = 0.1))
pp_racV
ps.tbl <- psmelt(physeq.objV)
pp_raV <- ggplot(ps.tbl) +
geom_col(aes(x = Sample, y = Abundance, fill = phylum), position = "fill") +
theme(axis.text.x = element_text(angle = 90)) +
ggtitle("The overall relative abundance for Phyla") +
guides(fill = guide_legend(keyheight = 0.1))
pp_raV
pp_racV <- ggplot(ps.tbl) +
geom_col(aes(x = Sample, y = Abundance, fill = phylum), position = "fill") +
facet_grid(~condition, scales = "free", space = "free") +
theme(axis.text.x = element_text(angle = 90)) +
ggtitle("The realtive abundance for Phyla divded by condition") +
guides(fill = guide_legend(keyheight = 0.1))
pp_racV
An important aspect, not discussed in this modul, but usually
important to consider when dealing with complex count data as we often
have. All read count data matrices will suffer from the fact we have
different library sizes between our samples. The library size
is simply the total number of reads in a sample. This reflects our
sequencing efforts, and is a purely technical issue having no relation
to the composition of the underlying community. The data in our
physeq.obj
are all such raw read count data, which is why
we saw the variable bar-heights above.
By normalization we refer to any attempts to remove this library size effect for subsequent analyses. This is a topic where strong opinions have been communicated over recent years!
First some notation. Consider the read count matrix \(X\) (otu_table
in
phyloseq
) with \(p\)
samples in the columns and \(n\) OTUs
in the rows. The read count for OTU \(i\) and sample \(j\) we denote \(X_{i,j}\). The read counts for sample \(j\) is denoted \[X_{*j} = (X_{1,j},
X_{2,j},...,X_{n,j})\]
where we have data for \(n\) OTUs.
Think of this vector as column \(j\) in
the otu_table
. The normalized version of \(X_{*j}\) we denote \(Y_{*j}\). This will contain the same number
of data, but their value is different from those in \(X_{*j}\).
By dividing all read counts in a sample by the total for that sample,
we get relative abundances, i.e. the sum of all abundances is 1.0 for
all samples. This is one way of normalizing the data to get rid of the
library size effect, and is often referred to as the TSS (total
sum scaling) normalization. This means \[Y_{*j} =
\frac{X_{*j}}{\sum_{i=1}^{n}{X_{i,j}}}\] We will, however, loose
some information by doing this. A sample that has been sequenced
extensively will in general contain more taxa. This is not because the
community actually contains more taxa, but because we have explored it
more intensly. Thus, by not taking this into consideration we may reach
some bad conclusions. Some alpha diversity statistics are sensitive to
this.
An alternative way of normalization is rarefying. This basically means random re-sampling all samples to the same level of the lowest read count, i.e. the sum of read counts becomes identical for all samples. Thus, it is very similar to TSS, but results in read counts instead of relative values. All samples sum to the same read count value.
There is no simple formula describing this, but we can easily
implement it in R. Let us assume the smallest library size of our
samples is S
. Then:
for(j in 1:ncol(X)){
Y[,j] <- rmultinom(1, size = S, prob = X[,j])
}
where we assume X
and Y
are existing (read
count) matrices of proper dimensions. This procedure corresponds to
re-sampling with replacement. This means read count for some
taxon may be larger in \(Y_{*j}\) than
it was in \(X_{*j}\) if S
is close to the total sum in \(X_{*j}\). It is also possible to do
re-sampling without replacement, which we will look at in an
exercise.
It should be noted that a ‘method’ called Common Sum Scaling (COM) is simply:
This produces ‘rarefied’ data by simply scaling the TSS values. It seems to me pointless, and we could stick to the TSS without multiplying them all with the same value.
Here is an article by the people behind the phyloseq
package that dismisses rarefying altogether, as a rather stupid thing to
do: https://www.ncbi.nlm.nih.gov/pubmed/24699258. The reason
is that we throw away data. In addition, if we do the random sampling,
we also get additional noise.
Even if this approach is debated, it is still used, and at least for
some analyses it is relevant. By throwing away data, you make all
analyses harder for yourself, but this is also considered a ‘safe’ error
to make in science. It lowers the probability of both true positive and
false positive findings, which is considered better than the opposite.
Here is an
article where rarefying is compared to other normalizations. Even
the phyloseq
package has a function
rarefy_even_depth()
!
The Cumulative Sum Scaling is similar to TSS, but instead of dividing by the total sum in each sample, we sum only the smaller values. The reasoning here is that some taxa tend to be over-sampled, i.e. they have more reads than they should, and we should not include the largest read counts when doing the normalization.
For sample \(j\), assume we have sorted the read counts in ascending order. Then the formula is: \[Y_{*j} = \frac{X_{*j}}{\sum_{i=1}^{l}{X_{i,j}}}\] where the crucial point is we sum to \(l<n\) in the denominator, i.e. we do not divide by the total read count, but some smaller value.
Which value do we use for \(l\)? If we used \(l=n\) we are back to TSS. Compute first the median read count over all samples, \(m\). Then, compute the \(l\)-quantile for each sample, \(q_{jl}\). This is the \(l\)th largest value, i.e. if \(l=0.75\) then we pick value \(0.75\cdot n\) among the \(n\) sorted values. If we compare these quantiles to the median \(m\) we typically observe that as long as \(l\) is around \(0.5\) they are all near \(0\). As we increase \(l\) the differences to the median grows, and these differences also become very different from each other (larger variation). Thus, we stop when \(l\) produces differences to the median that reaches a tolerated variation set by the user. Note that if all samples are fairly similar to each other with respect to composition, we may end up using \(l=n\), i.e. a simple TSS normalization.
Note that a CSS normalization will in general not eliminate the
library size effect, only lower it.
The Relative Log Expression is an idea from gene expression
(RNAseq) data analysis. It is described in the articles behind the R
package DESeq2
, which is used for analyzing gene expression
read count data. Our read count data are not from gene expression
studies, but are still read counts, and the idea may be used. We will
revisit the DESeq2
in module 12.
To understand this we need some more notation. The geometric mean (as opposed to the arithmetic mean) of some vector with \(n\) elements is the product of the values, and then the \(n\)th root of this, i.e.
\[g(x) = (x_1\cdot x_2\cdots
x_n)^{1/n}\] Next, let \(X_{i*}\) be the read count values for OTU
number \(i\) across all \(p\) samples in our read count matrix (this
is a row in the otu_table
). For each of these
vectors we compute the geometric mean \[g(X_{i*}) = (X_{i,1}\cdot X_{i,2}\cdots
X_{i,p})^{1/p}\] Note that if you take the logarithm on both
sides, the right hand side is simply the arithmetic mean (average) of
the log-transformed read counts. Think of \(g(X_{i*})\) as some sort of average (log)
read count for OTU number \(i\).
Next, we want to compute the size factor for sample \(j\). First we compute \[S_j = median(\frac{X_{1,j}}{g(X_{1*})}, \frac{X_{2,j}}{g(X_{2*})},..., \frac{X_{n,j}}{g(X_{n*})} )\] Note that the elements we take the median of are read counts for each OTU scaled by the ‘average’ (geometric mean) read count for that OTU in the entire data set, Thus, if sample \(j\) has a large library size, the \(S_j\) becomes large, and smaller if the number of reads is low. The size factor is then
\[Sz_j = \frac{S_j}{g(S_1, S_2,...,S_p)}\] where the denominator is the geometric mean of the \(S_j\) values (one from each of the \(p\) samples). Again, the \(Sz_j\) reflects the library size, a larger \(Sz_j\) means sample \(j\) contains many reads compared to the rest (and vice versa). It is also a value around 1, above 1 if the sample is large and below 1 if the sample is small compared to the rest. Then (finally) the normalization is
\[Y_{*j} =
\frac{\sum_{j=1}^{p}Sz_j/p}{Sz_j} X_{*j}\] We observe that the
normalized read counts are the original ones multiplied by a factor. If
the size factor \(Sz_j\) is larger than
average for the samples, the factor we multiply with above is less than
1, i.e. read counts are reduced. Opposite, if the library size is small,
the \(Sz_j\) is less than average, and
we scale the read counts up.
There is one issue we should mention here now. In the last method we computed the geometric mean of some vector: \[g(x) = (x_1\cdot x_2\cdots x_n)^{1/n}\] Notice that if one of the elements is \(0\), then the entire geometric mean is \(0\)! In our read count matrices the value \(0\) occurs quite frequently, i.e. some OTU is not observed in some sample. This seems to be devastating to this entire procedure (some of the \(S_j\)’s become infinite!).
Zeros occur in a read count matrix for at least two potential reasons:
The first is denoted an essential zero while the second is a count zero. In the latter case, we should actually not treat the value as zero, but as some small value indicating presence, but in too small quantities to be detected (yet). If you have several comparable samples, you will typically find that in the samples where you have the most reads, you also tend to have more OTUs. This is an indication of count zeros.
In such cases we impute the zeros, i.e. we replace them by some positive value indicating presence, but at very low abundance. The simplest approach is to add pseudo counts. This means we add a small and fixed value to all cells in the matrix, e.g. we add \(1\) read count extra to all cells (also the nonzero cells).
A more sophisticated approach is to use the data to impute a more
‘correct’ value in each cell. If we look at all data we may typically
find that some zero OTUs should have a larger imputed value than others.
The R package zCompositions
has some nice methods for doing
this.
NOTE: Imputed data are no longer raw data. Imputing data is typically a good idea prior to plotting and explorative analyses. For some analyses you should be careful imputing data. Some of the alpha diversity and distance methods (see below) needs information about presence/absence of the OTUs. This is obviously distored by any imputation. On the other hand, if our zeros are mostly count zeros (not essential zeros), any analysis method relying on presence/absence type of information is perhaps not a good idea?
The Centered Log Ratio transform also uses geometric means, but is much simpler than the RLE. The formula is simply \[Y_{*j} = \log( \frac{X_{*j}}{g(X_{*j})} )\] This is often referred to as Aitchisons transform of compositional data in the literature, see https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5695134/. By compositional data we refer to data where the relative composition (between OTUs) is informative, while the total sum in each sample is not. The CLR method can be seen as an extension of the TSS. It is straightforward to show that if you first do TSS on your data, and then CLR, you get the exact same result as doing CLR right away. We say the CLR is scale-invariant. After CLR the data are no longer ‘trapped inside a simplex’, but can take all kinds of values, also negative. The data are now log-ratios, and the log of a ratio smaller than 1 is negative.
The fact that read count data are compositional is not always
acknowledged, and here is a paper
where this is pointed out and discussed. There is a range of methods
developed for compositional data, and this is slowly finding its way
into the bioinformatics of read count table analysis. This figure is
copied from the paper linked to above:
The essence to remember is that the number of reads we sequence is limited/fixed by our procedure/technology, and not by the composition of the taxa in the community. Thus, when a proportion of taxon A increases from sample 1 to 2, it is not necessarily because there are more of taxon A, but could be because there are less of taxon B. This negative correlation structure imposes some challenges to methods that would work fine in other, non-compositional, cases. The CLR transform has been shown to have some nice properties for compositional data.
In the phyloseq
package we find the function
transform_sample_counts()
. This we may use to normalize the
read count values for each sample. It takes two essential arguments
phyloseq
object with the data.otu_table
.Perform the TSS normalization this way. It means we need a function that will take as input a vector of read counts (the \(X_{*j}\) above) and then compute the normalized vector (\(Y_{'j}\)) by simply dividing by its sum. Here is a small R function doing this
TSS <- function(Xj){
Yj <- Xj / sum(Xj)
return(Yj)
}
Add the code that normalizes physeq.obj
,
physeq.objV
and stores the results in
physeq.tss
and physeq.tssV
, respectibvely.
Then, use the function otu_table()
to extract the
normalized read count matrix from the latter object, and compute the sum
for each sample vector to verify the values are TSS normalized.
physeq.tss <- transform_sample_counts(physeq.obj, TSS)
colSums(otu_table(physeq.tss))
## sample_1A sample_2A sample_3A sample_4A sample_5A sample_1B sample_2B sample_3B
## 1 1 1 1 1 1 1 1
## sample_4B sample_5B sample_1C sample_2C sample_3C sample_4C sample_5C
## 1 1 1 1 1 1 1
physeq.tssV <- transform_sample_counts(physeq.objV, TSS)
colSums(otu_table(physeq.tssV))
## sample_1A sample_1B sample_1C sample_2A sample_2B sample_2C sample_3A sample_3B
## 1 1 1 1 1 1 1 1
## sample_3C sample_4A sample_4B sample_4C sample_5A sample_5B sample_5C
## 1 1 1 1 1 1 1
Use the rarefy_even_depth()
function to compute rarefied
read counts, where we sample without replacement, and store
this in physeq.rar
and physeq.rarV
. Again,
compute the sum of values in all samples in the normalized
otu_table
.
physeq.rar <- rarefy_even_depth(physeq.obj, replace = F)
colSums(otu_table(physeq.rar))
## sample_1A sample_2A sample_3A sample_4A sample_5A sample_1B sample_2B sample_3B
## 1901 1901 1901 1901 1901 1901 1901 1901
## sample_4B sample_5B sample_1C sample_2C sample_3C sample_4C sample_5C
## 1901 1901 1901 1901 1901 1901 1901
physeq.rarV <- rarefy_even_depth(physeq.objV, replace = F)
colSums(otu_table(physeq.rarV))
## sample_1A sample_1B sample_1C sample_2A sample_2B sample_2C sample_3A sample_3B
## 10412 10412 10412 10412 10412 10412 10412 10412
## sample_3C sample_4A sample_4B sample_4C sample_5A sample_5B sample_5C
## 10412 10412 10412 10412 10412 10412 10412
Make a function that normalizes a sample by the CLR. Make adding
pseudo-counts an option, and use 1 as default value. Then, use the
transform_sample_counts()
from above to normalize, and
store this in physeq.clr
and physeq.clrV
,
respectively. Again, compute the sum for each normalized sample.
Hint: Instead of computing the geometric mean, compute directly its logarithm. Use the rules for logarithms from introductory mathematics.
CLR <- function(Xj, q = 1){
lXj <- log(Xj + q)
lg <- sum(lXj) / length(lXj)
Yj <- lXj - lg
return(Yj)
}
physeq.clr <- transform_sample_counts(physeq.obj, CLR)
colSums(otu_table(physeq.clr))
## sample_1A sample_2A sample_3A sample_4A sample_5A
## 1.332268e-15 5.329071e-15 7.105427e-15 -1.065814e-14 -8.437695e-15
## sample_1B sample_2B sample_3B sample_4B sample_5B
## 1.887379e-15 5.773160e-15 -5.773160e-15 -3.996803e-15 4.440892e-16
## sample_1C sample_2C sample_3C sample_4C sample_5C
## 1.332268e-15 -6.217249e-15 -1.332268e-15 5.995204e-15 1.199041e-14
physeq.clrV <- transform_sample_counts(physeq.objV, CLR)
colSums(otu_table(physeq.clrV))
## sample_1A sample_1B sample_1C sample_2A sample_2B
## -4.440892e-15 -1.110223e-14 1.110223e-14 -4.440892e-15 -4.440892e-15
## sample_2C sample_3A sample_3B sample_3C sample_4A
## 4.884981e-15 3.552714e-15 9.769963e-15 -1.332268e-14 -1.421085e-14
## sample_4B sample_4C sample_5A sample_5B sample_5C
## 8.881784e-16 4.440892e-16 3.108624e-15 -6.661338e-15 1.776357e-15