We end by a module on how we may perform and understand the basics of some hypothesis testing on microbiota read count data.
In this last module we DO NOT USE ORION at all. There are two good reasons for this:
phyloseq
objects, and
these are small enough to be handled by most computers. No need for HPC
anymore.Thus, you should install R and RStudio on your own computer (you probably already have this), and then install the following R packages from Bioconductor:
This may take quite some time, since there will be many dependencies.
As usual, we also use the tidyverse
packages a lot below
(wrangling and plotting).
Let us use a data set from the human gut. Here the OTUs have been
classified to the genus rank, and the genus names are as the ‘OTU
names’. The data are found inside a phyloseq
object that
you may load into R by
library(tidyverse)
library(phyloseq)
load(url("https://arken.nmbu.no/~larssn/teach/bin310/gut.phy.RData"))
Before you start analyzing (any) data, you must have a minimum of insight into the size and structure of the data set. Let us start by repeating some stuff from module 11 and get some insight into the data.
Extract the sample_data
from the phyloseq
object, and store it in sample.tbl
. What is the data type
of sample.tbl
?
The people behind the phyloseq
package has decided to
make ‘special’ data types for data structures where we could easily have
used an ordinary table or matrix. This makes it a hassle to work with in
some respects. Convert the sample.tbl
to an ordinary
data.frame
. Hint: First convert it to a
matrix
, then to data.frame
.
Inspect this table. How many samples are there?
There are 4 columns of data about the samples. What data types
(numerical, text, factor?) are these? Hint: Use the
summary()
function.
We want to have
sex
as a factor.age
as numeric.nationality
as a factor.bmi_category
as a factor with levels in this order:
c("underweight", "lean", "overweight", "obese", "severeobese", "morbidobese")
Make these changes to the sample data table. Then, make an updated
phyloseq
object with this table as the
sample_data
.
Keep the sample.tbl
as a separate object as well, we
will use it again later.
sample.tbl <- sample_data(gut.phy) %>%
as.matrix() %>%
as.data.frame() %>%
mutate(sex = factor(sex)) %>%
mutate(age = as.numeric(age)) %>%
mutate(nationality = factor(nationality)) %>%
mutate(bmi_category = factor(bmi_category, levels = c("underweight", "lean", "overweight", "obese", "severeobese", "morbidobese")))
gut.phy <- phyloseq(sample_data(sample.tbl),
otu_table(otu_table(gut.phy)),
tax_table(tax_table(gut.phy)))
Extract the read count table (OTU table) from the
gut.phy
object, and store in otu.mat
. What
data type is this?
Convert it to a matrix
. How many OTUs are there in this
data set?
Compute the library sizes for all the samples, and make a histogram
of them (use sample_sums()
or colSums()
).
Keep the otu.mat
as a separate object, we will use it
again later.
otu.mat <- otu_table(gut.phy) %>%
as.data.frame() %>%
as.matrix()
#View(otu.mat)
cat("There are", nrow(otu.mat), "rows, i.e. OTUs in this data matrix\n")
## There are 130 rows, i.e. OTUs in this data matrix
library.sizes <- sample_sums(gut.phy) # alternative 1
library.sizes <- colSums(otu.mat) # alternative 2
ggplot() +
geom_histogram(aes(x = library.sizes), bins = 25)
We have heard that the alpha diversity of the gut may be important
for your health, i.e. a too low diversity is bad. Make a plot where you
display the Shannon alpha diversity such that we can see if there is an
association between this an the bmi_category
variable.
A straightforward richness plot may be used, but we also add a layer of boxplot to see where the majority of samples are in each category:
plot_richness(gut.phy, x = "bmi_category", measures = "Shannon") +
geom_boxplot()
There is some tendency that diversity goes down as obesity
increases.
This figure is sub-optimal since it displays both all data as point and then boxplots on top. In a report I would make it like this:
estimate_richness(gut.phy, measures = "Shannon") %>%
bind_cols(sample.tbl) %>%
ggplot() +
geom_boxplot(aes(x = bmi_category, y = Shannon), fill = "grey") +
labs(x= "BMI category", y = "Alpha diversity (Shannon)")
Notice that the BMI categories are now sorted in the ‘correct’ order
along the x-axis. This is because we made it into a factor with levels
in that exact order. If we had not done that, ggplot()
would have sorted them alphabetically, which is not a good way to
display them. Categories who have some ordering should always
be displayed in that order!
This ordering of the categories is also very important for two of the methods we will use below.
In the data above we have collected metabarcoding data from the gut of some people. We may then group these people into categories, we saw this above. We may then be interested in: Are there any OTUs who are consistently different in abundance between these groups? Identifying such OTUs may be of interest in order to understand more about the interplay between host and gut microbiota.
In general we have sampled some microbial communities from various
treatments/conditions/sites, which means we can group our
samples by this knowledge. This means we have an OTU table
(otu_table
in phyloseq
) with (normalized) read
counts, and if the samples are the columns, we may group the columns.
The samples are grouped by our knowledge of the various
treatments/conditions/sites that our samples are from. We typically have
this type of information in our sample table (sample_table
in phyloseq
).
Let us now focus on one OTU at the time, and we denote this OTU number \(i\). Let us assume there are two groups only, and we denote them A and B. Formally, we may think like this: If we had infinite many samples from both groups, we could compute the expected abundance for OTU \(i\) for both groups by averaging over the infinite many samples. If these expected values are identical there is no differential abundance for this OTU between the two conditions. If we denote these expected abundances \(\mu_{i,A}\) and \(\mu_{i,B}\), respectively, we have the null hypothesis: \[H_0: \mu_{i,A} = \mu_{i,B} \] In reality we do not have infinite many samples, and we have to make some probabilistic statement based on the data we have, but we must be aware this is the hypothesis we now test.
Note that there will now be a new hypothesis test for each new OTU,
\(i=1,2,...\). We simply have to loop
over all OTUs, and perform the same test again and again based on the
data for each OTU. In the cases where we can reject \(H_0\) we have found something that seems to
differ consistently between the two conditions. These are the OTUs we
find interesting.
Notice that when testing for differential abundance, we have to normalize the raw read counts in some way. If not, the number of reads we have sequenced will influence the outcome of the test, which is nonsense. It is not like people are diseased because you sequenced their gut bacteria with more reads than you did for the healthy people!
This means the data in our OTU table must either already be normalized in some way, or the procedure we employ do this normalization as part of the method itself.
Assume we have the \(TSS\) normalized data, which means we have relative abundance data and the sum of abundances for each sample is 1.0. A rarefaction would also have a similar effect. A very simple approach to the differential abundance testing is then to employ the two-sample t-test for each OTU. You should be familiar with this from introductory statistics.
However, this is not commonly used, simply because it assumes the data are normal distributed. We know that data of this type (read counts) often is very far from normal, with many smallish values and some very big ones. But, this depends on you normalization, and we will not state that a t-test cannot be used. The test is actually quite robust to the assumption of normality.
An alternative to the t-test, and equally simple, is the Wilcoxon two-sample test. This is a non-parametric test meaning it makes no assumptions about the distribution of the data. Instead, the data are just sorted, and their ordering (rank) is used in the test. Note that if \(H_0\) is true, then the values from the two groups should come in a rather mixed order after sorting. If you simply sum the ranks for group A and B, they should come out fairly similar. This is what we use in the Wilcoxon test. If, say, all the abundances from group A are larger than all the values from group B, the sum of orders become very different, indicating we can reject \(H_0\) and claim we have found a differential abundance. The p-value come from computing how often we would see a difference in orders at least as large as the one we compute, given that \(H_0\) is true and that we sample at random the orders of the samples from the two groups.
The reason we do not always use this test, instead of the t-test, is that it has less power than the t-test. This means we need to have many samples in order to reject \(H_0\) with this approach. Also, it does not use the actual abundance values, only their ordering. This means huge variances are not ‘seen’ in the same way as for instance in a t-test. Also, for metabarcoding data there are some other biases who are ignored by this simple approach, and it may tend to come out with slightly too many differential abundant taxa. However, the Wilcoxon test is simple to understand and always something you should have in your ‘toolbox’. We use it here as an entry into this topic, simply because everything is very transparent and easy to follow.
wilcox.test()
We saw in an exercise above that the diversity seemed to be lower in
people from the category morbidobese
compared to the
overweight
. Let us see if we can identify some OTUs who
have consistently different abundances between these two categories.
Let us make the code required to run the Wilcoxon test on each OTU in
the gut data set, and use its relative abundance values from the
categories overweight
and morbidobese
only.
First, use the transform_sample_counts()
on the
gut.phy
object to get relative abundances in all samples
(this is the \(TSS\) normalization from
module 11). Next, use the subset_samples()
function to
extract only the samples where
bmi_category %in% c("overweight", "morbidobese")
. Store the
result in a new phyloseq
object.
Extract the sample.tbl
and otu.mat
from
this new phyloseq
object, see exercise above.
To perform the Wilcoxon test on the first OTU, the code is
tst <- wilcox.test(otu.mat[1,] ~ sample.tbl$bmi_category)
The p-value for this test is then found in
tst$p.value
.
Make the code to run the test for all OTUs (not only the first), and collect the p-values as a column in a table. This table should also have a corresponding column with the OTU-labels such that we know which OTU belongs to which p-value.
Here is a code skeleton you may start with:
# Assume we have the gut.phy from above
# We subset, using only the two groups of interest
gut_sub_norm.phy <- transform_sample_counts(___, function(x){x / sum(x)}) %>%
subset_samples(___)
# We collect the sample.tbl and otu.mat from the phyloseq object
sample.tbl <- sample_data(___) %>%
as.matrix() %>%
as.data.frame()
otu.mat <- otu_table(___) %>%
as.data.frame() %>%
as.matrix()
# Time to run the tests!
wilcox.tbl <- data.frame(OTU = rownames(___),
p_value = -1)
for(i in ___){
tst <- wilcox.test(___[i,] ~ sample.tbl$bmi_category)
wilcox.tbl$p_value[i] <- ___
}
# Assume we have the gut.phy from above
# We subset, using only the two groups of interest
gut_sub_norm.phy <- transform_sample_counts(gut.phy, function(x){x / sum(x)}) %>%
subset_samples(bmi_category %in% c("overweight", "morbidobese"))
# We collect the sample.tbl and otu.mat from the phyloseq object
sample.tbl <- sample_data(gut_sub_norm.phy) %>%
as.matrix() %>%
as.data.frame()
otu.mat <- otu_table(gut_sub_norm.phy) %>%
as.data.frame() %>%
as.matrix()
# Time to run the tests!
wilcox.tbl <- data.frame(OTU = rownames(otu.mat),
p_value = -1)
for(i in 1:nrow(otu.mat)){
tst <- wilcox.test(otu.mat[i,] ~ sample.tbl$bmi_category)
wilcox.tbl$p_value[i] <- tst$p.value
}
At least we get some small p-values (well below 0.05), so perhaps we have found some interesting OTUs here?
Whenever we make a series of hypothesis tests, and ‘search’ for interesting findings in a systematic way like we did above, it is customary to adjust the p-values before we make any statements about where we reject the null-hypothesis and claim we have a significant differential abundance.
The idea is simply that since the p-values are probabilities, and we draw a threshold at some probability value, there is always some probability of false ‘findings’. If we use the ordinary threshold of 0.05 (5%) then, even if the null-hypothesis is actually true (no differential abundance) for all the OTUs, we will on average find that for 5% of them the data just happen to look such that the test ends with a p-value below 0.05. If we have many OTUs, then 5% means we may end up with quite many such false cases. To protect us against this, we should be stricter than normal, but how strict?
It is customary to adjust the p-values and keep the same threshold (e.g. 0.05). The adjustment may be done in several ways, but we will not dig into this here. Instead, we opt for the adjustment called controlling the false discovery rate. The idea behind this adjustment may be understood as follows: If you use the raw p-values and the significant threshold at 5%, you must expect to get 5% false positives among all tests (OTUs). If you use the same threshold on the adjusted q-values, you must expect 5% false positives among those you claim to be positives. Clearly the latter gives fewer false positives.
This means we replace all p-values with the adjusted version, often referred to as q-values, and then say the OTUs we look for are those where the q-value is below 0.05 (or whatever threshold we choose). The q-values are always at least as large as the p-values, and typically fewer OTUs will come out with a q-value below 0.05 than a p-value below 0.05. Thus, we are stricter.
In R we have a function p.adjust()
that has implemented
a range of methods for adjusting the p-values. To use it you
p.adjust()
.method
option,
e.g. method="fdr"
to use the false discovery rate
adjustment.Please note that this procedure will only adjust the p-values you have. It is no guarantee against false findings. If the p-values are rubbish from the start, this adjustment will not change that fact!
Compute the q-values from the p-values we got above, and store these
in a new column q_value
in the same table as we made above.
It is also a good idea to sort the table by this q_value
column.
How many OTUs are differential abundant between categories
overweight
and morbidobese
if you use the 5%
threshold on the computed q-values?
Add to the previous script:
wilcox.tbl <- wilcox.tbl %>%
mutate(q_value = p.adjust(p_value, method = "fdr")) %>%
arrange(q_value)
cat("We find", sum(wilcox.tbl$q_value < 0.05), "OTUs have a differential abundance")
## We find 61 OTUs have a differential abundance
If an OTU has a small enough q-value we can claim it has a
differential abundance between overweight
and
morbidobese
. But, under which category is it more/less
abundant? We should also keep this information in our table. Add two new
columns where you add the average relative abundance for each of these
two groups for each OTU.
This may be solved in many ways, e.g. by pivot_longer()
and group_by()
. Here is an alternative way:
overweight.cols <- which(sample.tbl$bmi_category == "overweight")
morbidobese.cols <- which(sample.tbl$bmi_category == "morbidobese")
wilcox.tbl <- wilcox.tbl %>%
mutate(overweight_mean = rowMeans(otu.mat[,overweight.cols])) %>%
mutate(morbidobese_mean = rowMeans(otu.mat[,morbidobese.cols]))
You may also want to mutate
in a column
difference = overweight_mean - morbidobese_mean
so that its
signs quickly tells you in which category it is more abundant.
When we go hunting for differential abundance OTUs, we may easily find that many OTUs come out with very small q-values, but if we look at their abundances they are sometime very low abundant in both groups. We actually detect the difference between ‘very low’ and ‘super low’.
How interesting are these OTUs really? This is something we always have to ask ourselves. An organism with very low abundance will in most cases not be very important for the environment as a whole, but there are exceptions, and you need to dig more into this.
My point here is that finding a significant differential abundant OTU is a necessary but not a sufficient criterion for reporting it as interesting. For this reason, it is not uncommon to also have a threshold on the average abundance as well, and only report differential abundances for OTUs with abundance above this threshold.
DEseq2
approachTesting if OTUs have a differential abundance between two conditions is very similar to testing if genes have differential expression between two groups of samples. Gene expression data is not something we talk about here in BIN310, since BIN315 is about this topic. We can just say that gene expression is typically measured by sequencing of the mRNA, what is known as RNAseq data. These reads are then mapped to the known genes of the organism(s) and counted, and we get a read count table very similar to what we have for metabarcoding data. Instead of read counts for each OTU we have read counts for each gene, and this is arranged in a table similar to our OTU table. Just replace OTUs with genes, and the data structure is the same. The higher/lower gene expression is then similar to a higher/lower OTU abundance.
There are some commonly used tools for testing if genes are
differential expressed. We will not dig into these methods, this belongs
more to a course in gene expression data analysis (BIN315). But we will
take a look at how we can use one of these tools to also test for
differential abundance in metabarcoding data, more specifically the R
package DESeq2
. The people behind the phyloseq
package also has a suggested procedure for doing this testing that we
follow here now.
When using the DESeq2
approach it is essential that you
use non-normalized raw data as input since the procedure actually
implicitly makes its own type of normalization. Without going into the
details, we can say that here read count data are treated as data
following a negative binomial distribution, and the model is
fitted accordingly. A negative binomial is similar to a Poisson
distribution, but with larger variance. In a Poisson the variance equals
the expectation, they are one and the same thing, but in the negative
binomial the variance is a parameter independent of the expectation just
like in a normal distribution. Negative binomials are commonly used for
count data when the data variation is too large to be explained by a
Poisson.
We again use only the subset of samples from the two categories we used above:
# Assume we have gut.phy from above
gut_sub.phy <- subset_samples(gut.phy, bmi_category %in% c("overweight", "morbidobese"))
Next, we convert our phyloseq
object to a
DESeq2
object:
library(DESeq2)
gut_sub.dsq <- phyloseq_to_deseq2(gut_sub.phy, ~ bmi_category)
Notice that already here we specify the design of the linear
model to be fitted, i.e. we specify that the grouping we are interested
in is described by the bmi_category
variable in our
sample_data
.
Next, we need some geometric means in order to estimate some size factors. This is actually the \(RLE\) normalization that we saw in module 11!
gm_mean <- function(x, na.rm = TRUE){ # local function
exp(sum(log(x[x > 0]), na.rm = na.rm) / length(x)) # for computing geometric mean
}
gmeans <- apply(counts(gut_sub.dsq), 1, gm_mean) # counts(gut_sub.dsq) is the read count matrix
gut_sub.dsq.norm <- estimateSizeFactors(gut_sub.dsq, geoMeans = gmeans) # the RLE normalization
The last step is the actual fitting of the linear model, and computations of all results
gut_sub.dsq.norm.fit <- DESeq(gut_sub.dsq.norm, fitType="local")
The results()
function will extract some results from
the complex gut_sub.dsq.norm.fit
object. It will return a
list
, that we immediately convert into a table:
deseq.tbl <- results(gut_sub.dsq.norm.fit, pAdjustMethod = "fdr") %>%
as.data.frame()
This table lists all OTUs, and each column is some result from the
DESeq2
procedure. The last two columns are raw p-values and
adjusted p-values. Notice we used the same p-value adjustment here as we
did earlier(the fdr
). The second column is the log fold
change or log-ratio. If this is (large) positive, it means this OTU is
more abundant in morbidobese
compared to
overweight
. If it is (large) negative, it is opposite. Note
the ordering of the categories! If you did the first exercise in this
module the bmi_category
should a factor
where
the level overweight
is listed before
morbidobese
. If you did not do that exercise, the ordering
will be opposite, since then DESeq2
will simply use the
alphabetical ordering! The DESeq2
will always compare all
other levels against first factor level. By explicitly converting the
grouping variable into a factor
yourself, you control which
group is the ‘reference-group’ (first level). We will see this again
below, with the ANCOMBC
approach.
We want to make a plot of the Wilcoxon q-values for each OTU against the DESeq2 q-values, and color this according to if it is below 0.05 for both or one of them. We also make the marker size depending on the average abundance for each OTU. In this way we can how the two methods agree/disagree.
First, from the res.tbl
above, select the
padj
column, and add a new column OTU
containing the row names of res.tbl
.Then join to the
wilcox.tbl
we made earlier. Join by the "OTU"
column.
Filter the rows (OTUs) to keep only those where at least one of the
methods indicate differential abundance, i.e. either padj
or q_value
, or both, is below 0.05.
Add a new column significance
indicating if both methods
agree on significance, if only Wilcoxon or only DESeq2 claims this. The
if_else()
function is nice for doing this.
Add also a column abundance
with the average of the
abundances. For each OTU. We already computed this for
overweight
and morbidobese
separately, now you
only need to average these two.
Finally, plot padj
versus q_value
as
points, color by significance
and let the size
be mapped to the abundance
. Here is a code skeleton:
both.tbl <- deseq.tbl %>%
select(___) %>%
mutate(OTU = ___) %>%
right_join(wilcox.tbl, by = ___) %>%
filter(padj < ___ | ___ < 0.05) %>%
mutate(significant = "both") %>%
mutate(significant = if_else(___ > 0.05, "DESeq2", significant)) %>%
mutate(significant = if_else(___ > 0.05, "Wilcoxon", significant)) %>%
mutate(significant = factor(significant, levels = c("Wilcoxon", "both", "DESeq2"))) %>%
mutate(abundance = (___ + overweight_mean) / 2)
fig <- ggplot(___) +
geom_point(aes(x = ___, y = ___, color = ___, size = ___)) +
geom_hline(aes(yintercept = 0.05)) + geom_vline(aes(xintercept = 0.05)) +
scale_x_log10() + scale_y_log10() +
labs(x = "Wilcoxon q-value", y = "DESeq2 q-value", size = "Mean relative\nabundance")
print(fig)
To what extent do these two approaches agree on what is a differential abundant OTU?
both.tbl <- deseq.tbl %>%
select(padj) %>%
mutate(OTU = rownames(deseq.tbl)) %>%
right_join(wilcox.tbl, by = "OTU") %>%
filter(padj < 0.05 | q_value < 0.05) %>%
mutate(significant = "both") %>%
mutate(significant = if_else(q_value > 0.05, "DESeq2", significant)) %>%
mutate(significant = if_else(padj > 0.05, "Wilcoxon", significant)) %>%
mutate(significant = factor(significant, levels = c("Wilcoxon", "both", "DESeq2"))) %>%
mutate(abundance = (morbidobese_mean + overweight_mean) / 2)
fig <- ggplot(both.tbl) +
geom_point(aes(x = q_value, y = padj, color = significant, size = abundance)) +
geom_hline(aes(yintercept = 0.05)) + geom_vline(aes(xintercept = 0.05)) +
scale_x_log10() + scale_y_log10() +
labs(x = "Wilcoxon q-value", y = "DESeq2 q-value", size = "Mean relative\nabundance")
print(fig)
If we consider the smallest q-values from both methods, these OTUs are claimed differential abundant in both cases. In addition to these, the Wilcoxon test gave more small q-values than the DESeq2 approach here.
ANCOMBC
The R package ANCOMBC
is one out of several R packages
purpose made for metabarcoding data analysis, and with emphasis on
differential abundance testing. If you were to pick only one tool for
this job, this is a good candidate.
Let us first use the tool to see how we can get some differential abundance testing results, and we use the exact same data as we did above. Extend the R script we have from above with:
library(ANCOMBC)
# Assume we have gut.phy from above
gut_sub.phy <- subset_samples(gut.phy, bmi_category %in% c("overweight", "morbidobese"))
ancom.obj = ancombc2(data = gut_sub.phy,
fix_formula = "bmi_category",
p_adj_method = "fdr",
verbose = TRUE)
ancom.tbl <- ancom.obj$res
First, notice the function we actually use is named
ancombc2()
. Here we used a minimum of input, there are many
input arguments to this function you may explore. The
fix_formula
is needed to indicate what you want to test. By
naming the column bmi_category
we state we want to fit a
linear model with the factor (2 levels) in this column as explanatory
variable. This is the same as testing if the abundance is different
between these two levels (morbidobese
and
overweight
). The p_adj_method
is simply the
method in p.adjust()
to use for adjusting the p-values,
like we saw above. Here we use the "fdr"
to make it
identical to what we did earlier. The verbose
you can turn
on/off (TRUE
/FALSE
) to get some printout
during computations.The computations take some time.
Let us inspect the table with results:
View(ancom.tbl)
What we would be interested in here are the columns
lfc_bmi_categoryoverweight
q_bmi_categoryoverweight
The lfc
is short for log fold-change, while the
q
value is the adjusted p-value. What is meant by
bmi_categoryoverweight
? It is the name of the
fix_formula
column with one factor level appended. The
column bmi_category
is a text column (unless you did the
very first exercise in this module). Then, ancombc2
will
convert it to a factor, with the two levels ("morbidobese"
and "overweight"
) sorted alphabetically, i.e. with
morbidobese
as the first level. The first factor
level is always used as a reference level by
ancombc2
, which is standard behavior for many functions in
R, and we saw it also for the DESeq2
above. Thus, the
q-values under q_bmi_categoryoverweight
are for testing if
the abundance changes when going from the reference level
(=morbidobese
) to the level overweight
.
lfc_bmi_categoryoverweight
lists how much the abundance
changes when going from the reference level (=morbidobese
)
to overweight
. It is a log-transformed fold-change,
i.e. you divide the overweight
abundance by the reference
(=morbidobese
) abundance, and then take the logarithm.
There are no columns for morbidobese
since all information
is in comparing against the reference.
NB!. If you did the first exercise in this module,
you specifically changed the bmi_category
to a factor, and
also listed the factor levels. Then overweight
will come
before morbidobese
and instead of alphabetical ordering
these levels are now used. Then, the reference becomes
overweight
, and you get output columns like
lfc_bmi_categorymorbidobese
instead. The q-values will be
the same, but the log-fold change value will have opposite signs.
In the results above it seems like the
lfc_bmi_categoryoverweight
for Collinsella is
1.02
. Does this mean the abundance for Collinsella
is larger or smaller in morbidobese
compared to
overweight
persons? How much larger/smaller? Use the
information about reference level and the number 1.02
to
answer this.
Then, make a small R code to actually plot the raw data, and see if this agrees with the results.
Let \(A_o\) be the expected
Collinsella abundance for overweight
persons, and
\(A_m\) similar for
morbidobese
. Since the latter is the reference here, we get
that the log fold-change is \[lfc =
\log\left(\frac{A_o}{A_m}\right) = 1.02\]
By using the exponential function on both sides and rearranging the
equation we get \[A_o = e^{1.02}\cdot A_m =
2.77\cdot A_m\] Thus, we can see that \(A_o\) is larger than \(A_m\) and we can state Collinsella
is almost three times more abundant in overweight
persons
compared to morbidobese
persons.
otu.tbl <- otu_table(gut_sub.phy) %>%
as.matrix() %>%
t() %>%
as.data.frame()
collinsella.tbl <- otu.tbl %>%
mutate(sample_id = rownames(otu.tbl)) %>%
select(sample_id, Collinsella)
sample.tbl <- sample_data(gut_sub.phy) %>%
as.matrix() %>%
as.data.frame() %>%
full_join(collinsella.tbl, by = "sample_id")
ggplot(sample.tbl) +
geom_boxplot(aes(x = bmi_category, y = Collinsella))
Yes, it looks like the Collinsella abundances are in general larger
for overweight
compared to morbidobese
persons. The actual number are of course here heavily influenced by
library sizes, so we cannot from the raw data figure alone say how much
more abundant. For this we need some estimates, like we get from
ancombc2
.
If you again use the adjusted p-value threshold at 0.05, which taxa
come out as differential abundant according to ANCOMBC
?
Compare this to what we saw for the Wilcoxon test and
DESeq2
from above.
Here we choose to display the results inn a Venn diagram, and use the
R package ggvenn
:
all.tbl <- wilcox.tbl %>%
select(OTU, wilcox_q = q_value)
all.tbl <- deseq.tbl %>%
select(deseq_q = padj) %>%
mutate(OTU = rownames(deseq.tbl)) %>%
right_join(all.tbl, by = "OTU")
all.tbl <- ancom.tbl %>%
select(ancom_q = q_bmi_categorymorbidobese) %>%
mutate(OTU = rownames(ancom.tbl)) %>%
right_join(all.tbl, by = "OTU")
library(ggvenn) # install from github
DA.lst <- list(Wilcoxon = all.tbl$OTU[all.tbl$wilcox_q < 0.05],
DESeq2 = all.tbl$OTU[all.tbl$deseq_q < 0.05],
ANCOMBC = all.tbl$OTU[all.tbl$ancom_q < 0.05])
ggvenn(DA.lst)
We observe that for 33 OTUs all three methods agree these are
differential abundant (DA). As many as 22 OTUs are only found DA by the
Wilcoxon test, indicating perhaps this has a tendency to produce
slightly too many false positives? If we only compare
DESeq2
to ANCOMBC
we find they agree on 37,
but 7 are unique to DEseq2
while 4 are unique to
ANCOMBC
. If we require that at least 2 out of 3 methods
must agree, then we have 33+4+4+2 = 43 cases of DA OTUs here.
ANCOMBC
Let us dig a little into how the ANCOMBC
method works. A
comprehensive understanding of this requires much more statistics than
what is the prerequisits for this course, and we will therefore only try
to establish a rather shallow overview.
There are two scientific papers describing the methods used: https://pubmed.ncbi.nlm.nih.gov/29163406/ and https://pubmed.ncbi.nlm.nih.gov/32665548/. The latter is the main paper describing the method we will use below, but it builds partly on the first paper.
In module 11 we also talked briefly about this, here is a recap and extension from the first paper above.
What is important in the first paper is the treatment of zeros in the data. Metabarcoding data often contain many zeros, i.e. some OTU has 0 read counts in some sample. Such zeros arise for three distinct reasons, and for differential abundance testing should be treated different depending on this:
NA
indicating we have no reading at all here.Why is this important? Well, for good reason we would like to log-transform the read counts during the analysis. But, then we cannot have any zeros in our data, since the logarithm of zero is not defined (‘minus infinity’).
Imagine our samples are grouped into two sample groups
(e.g. overweight
and morbidobese
from above),
i.e. we have \(g=2\) sample groups. The
fundamental assumption of the grouping and testing is that all samples
in the same sample group come from the same underlying, ‘true’,
composition, and any variation between them is random noise. We expect
the (normalized) abundances within the group to be similar, but not
identical. Consider the data for OTU number \(i\), i.e. the corresponding row in the OTU
table. Then we have read counts from the first sample group and some for
the second.
An outlier zero for some OTU is defined as a zero that occurs in a few samples within a sample group when the other values for the same OTU, and in the same sample group, are not nearly zero at all. If an outlier zero is present, that observation is simply discarded from the data set, and we ‘loose’ one sample for OTU number \(i\). We do not discard entire samples, this is done for each OTU separately.
A structural zero for OTU number \(i\) means all read counts within a sample group are actually zero. This is detected when most values within the sample group are zero, but a few are not. It is in some sense the opposite of the outlier zeros. In this case the testing becomes really simple: If both sample groups have structural zeros for an OTU, then there is no differential abundance, simply because there is no abundance for that OTU! If one group is structurally zero and the other is not, then we have a differential abundance, simply because the OTU is present in one group but absent in the other. Thus, such OTUs can now be concluded upon right away, and are they are taken out of the further analysis.
The sampling zeros are those we are left with who are neither outlier nor structural zeros. For these we impute a small positive value, and since we are dealing with read counts, the value 1 is natural. Thus, we add one extra read to all read counts, i.e. we ‘donate’ one extra read to all read counts, assuming those who are now zero would have got 1 read had we sequence a little deeper. This of course introduces a small bias in our data, but we have bigger biases to deal with than this, so we consider this a reasonable choice.
Once this procedure is done, we can now log-transform all read count values, resulting in data with a distribution which is closer to the normal distribution or other well known distributions we have a machinery for in statistical theory.
The name, ANCOMBC
, is an acronym for “ANalysis of
COMpositional data with Bias Correction”. We mentioned compositional
data briefly in module 11, related to the \(CLR\) transform.
By compositional data we refer to data where the relative composition (between OTUs) is informative, while the total sum in each sample is not. Such type of data occur in many fields of science, and here we see an example. If you sum all read counts for a sample, you find this library size varies a lot between samples. This has nothing to do with how many bacteria of various taxa there are in the samples from the start. First, we extract DNA from the environment, and this is not equally effective each time we do it. Second, we do PCR amplifications. The number of PCR-cycles may also vary, and a sample with few bacteria may end up with many amplicons and vice versa depending on the PCR step. Finally, the sequencing technology will also partly determine how many reads we get. Thus, there is no information left in the total read count that can tell us anything about how many bacteria we started out with. If the latter is important, you must do some additional efforts (e.g. qPCR). All the information we get from metabarcoding data lies in the relative abundances, i.e. how much we have of one bacteria relative to the others within the same sample.
In the last paper linked to above, there is a figure (Figure 2) that
may give you the impression that with ANCOMBC
you can
actually estimate the real abundance of the bacteria (microbial load) in
the environment. I would say this figure is a bad illustration. What
they in the paper name a sample fraction is not what is shown
in that figure. The sample fraction estimated and given as output by
ANCOMBC
is not a fraction (value between 0 and 1) at all,
as suggested in that figure, but a sample scaling with values
around 1.0, some larger and some smaller than 1.0. Instead, think of
this sample scaling as a way to explain the observed differences in
library size.
The BC
in the ANCOMBC
is short for bias
correction. What is this bias? In order to understand this, we need to
dive into the models behind ANCOMBC
. The stuff I now
present has been taken from the second paper linked to above, but I use
a slightly different notation.
Consider OTU number \(i\) out of a total of \(m\) OTUs, i.e. \(i=1,...,m\). Consider its read counts in sample group \(j=1,2\) i.e. we have two sample groups (1 and 2 instead of A and B). This can easily be extended to more than two sample groups, but for simplicity we stick to this now. Within sample group \(j\) we have \(n_j\) samples, i.e. sample \(jk\) is sample number \(k=1,...n_j\) inside sample group \(j\). For the read count data we may then write \[E(Y_{ijk}) = \gamma_{jk}\theta_{ij}\]
where \(\theta_{ij}\) is the expected abundance of OTU \(i\) in sample group \(j\) and \(\gamma_{jk}\) is some constant. The \(\theta_{ij}\) is the true underlying abundance of OTU number \(i\) in group \(j\), but this is unknown to us. Notice that
After log-transform we may write this as the basic
ANCOMBC
model equation: \[y_{ijk} = \delta_{jk} + \mu_{ij} +
e_{ijk}\] where \(y_{ijk} =
\log(Y_{ijk})\) is the logarithm of the read counts
(log-counts). You may think of \(\delta_{jk}=\log(\gamma_{jk})\) as the
log-sample-scaling (LSS) and \(\mu_{ij} = \log(\theta_{ij})\) as
log-abundance (even if this is not 100% correct).
The error term \(e_{ijk}\) is the random noise that creates variation in the observed data in addition to what can be explained by the other terms on the right hand side. As always, \(E(e_{ijk}) = 0\) so this term only contributes with variance, no change in expected values (bias). It is assumed to have the same variance regardless of OTU (index \(i\)) or samples (indexes \(jk\)).
The expected log-counts are therefore described by \[E(y_{ijk}) = \delta_{jk} + \mu_{ij}\] Note that the \(\mu_{ij}\) are the quantities we are interested in, since these are the true underlying log-abundances for OTU \(i\) in sample group \(j\). Note also that this is the same for all samples in sample group \(j\), but the LSS, \(\delta_{jk}\), may add/subtract to the observed log-counts for each sample \(k\).
Let us now consider some averages. Let \[\bar{y}_{ij.}=\frac{1}{n_j}\sum_{k=1}^{n_j}y_{ijk}\] be the average log-count for OTU \(i\) in sample group \(j\). Notice how we use the bar to indicate an average, and the dot indicates which index we average over. Since the \(y_{ijk}\) are our observed data, we can always compute such averages. From the model equation above it follows that \[E(\bar{y}_{ij.}) = \bar{\delta}_{j.} + \mu_{ij}\] Notice how the right hand side contains the \(\mu_{ij}\) we are interested in, plus the LSS averaged over the samples in sample group \(j\).
Similar, we define \(\bar{y}_{.jk}\) as the average log-count in sample \(jk\), averaged over all OTUs. Then, \[E(\bar{y}_{.jk}) = \delta_{jk} + \bar{\mu}_{.j}\] where the last term is the average abundance, averaged over all OTUs. From the last equation we see that \[\delta_{jk} = E(\bar{y}_{.jk}) - \bar{\mu}_{.j}\] and a least-square estimator for the LSS \(\delta_{jk}\) is \[\hat{\delta}_{jk} = \bar{y}_{.jk} - \bar{y}_{.j.}\] where we simply replace the expected value for \(\bar{y}_{.jk}\) by its actually observed value, and the average log-abundance by the average log-count, averaged over all OTUs and samples inside sample group \(j\).
Next, recall from the model equation above that \(E(y_{ijk}) = \delta_{jk} + \mu_{ij}\), i.e. if we just rearrange \(\mu_{ij} = E(y_{ijk}) - \delta_{jk}\). From this, it seems natural to estimate \(\mu_{ij}\) by \[\hat{\mu}_{ij} = \frac{1}{n_j}\sum_{k=1}^{n_j}(y_{ijk} - \hat{\delta}_{jk})=\frac{1}{n_j}\sum_{k=1}^{n_j}y_{ijk} - \frac{1}{n_j}\sum_{k=1}^{n_j}\hat{\delta}_{jk} = \bar{y}_{ij.}-\bar{\hat{\delta}}_{j.}\] If we then plug in for the estimator \(\hat{\delta}_{jk}\) from above we get \[\hat{\mu}_{ij} = \bar{y}_{ij.} - \bar{\hat{\delta}}_{j.} = \bar{y}_{ij.} - \frac{1}{n_j}\sum_{k=1}^{n_j}(\bar{y}_{.jk}-\bar{y}_{.j.}) = \bar{y}_{ij.} - \frac{1}{n_j}\sum_{k=1}^{n_j}\bar{y}_{.jk} + \frac{1}{n_j}\sum_{k=1}^{n_j}\bar{y}_{.j.} = \bar{y}_{ij.} - \bar{y}_{.j.} + \bar{y}_{.j.} = \bar{y}_{ij.}\] Thus, we estimate the true log-abundance for OTU \(i\) simply by averaging the observed log-counts over all samples in sample group \(j\).
If this was an unbiased estimator, we should now have that \(E(\hat{\mu}_{ij}) = \mu_{ij}\). This is not the case. It is instead \(E(\hat{\mu}_{ij}) = \mu_{ij} - \bar{\delta}_{j.}\). This means that when we specify the null hypothesis for OTU \(i\) \[H_0: \mu_{i1} = \mu_{i2}\] we cannot use the difference in estimates \(\hat{\mu}_{i1} - \hat{\mu}_{i2}\) directly as an indicator if this is true or not because its expected value is \(E(\hat{\mu}_{i1} - \hat{\mu}_{i2}) = \mu_{i1} - \mu_{i2} + \bar{\delta}_{i2} - \bar{\delta}_{i1}\), and the extra term \(\bar{\delta}_{i2} - \bar{\delta}_{i1}\) means that the test statistic in general become larger than it should, and we tend to reject \(H_0\) when we should in fact not (false positives, claiming we have differential abundance when in fact we have not). The rest of the method description is very much about how to estimate and correct for this bias by refining the least squares estimates above, but it is beyond our scope to dig into this.
This bias-correction will in general mean that we get fewer significant OTUs compared to methods ignoring this, like the Wilcoxon test we used directly on \(TSS\) normalized data. In the second paper mentioned above you can read more about this, and see some comparisons to other methods, but it is difficult to trust such comparisons unless they are done by some neutral third party. It is better to try to understand the models and algorithms, and make up your mind if this seams reasonable.
ANCOMBC
settingIn module 11 we saw the \(CLR\) transform for normalizing metabarcoding data. If \(Y_{ijk}\) is the read count for OTU \(i\) in sample \(jk\) as above, let \(z_{ijk}\) be the CLR-transformed data.
Write first how this can be expressed as the log-counts \(y_{ijk}\) and some of its averages.
What is the average CLR-value inside sample \(jk\), i.e. averaged over all OTUs?
Discuss these results in the light of the bias from the
ANCOMBC
model.
To \(CLR\) transform we would first need to compute the geometric mean for sample \(jk\): \[g_{jk} = (Y_{1jk}\cdot Y_{2jk}\cdots Y_{mjk})^{ 1/m}\]
Then we would compute our transformed observations as: \[z_{ijk} = \log\left( \frac{Y_{ijk}}{g_{jk}}\right)\]
If we use the rules for logarithms, we get \[z_{ijk} = \log(Y_{ijk}) + \log(g_{jk}) = y_{ijk} - \frac{1}{m}\sum_{i=1}^{m}\log(Y_{ijk}) = y_{ijk} - \bar{y}_{.jk}\] Thus, the \(CLR\) transformed value \(z_{ijk}\) is the log-count \(y_{ijk}\) minus the average log-count for the sample \(jk\), averaged over all OTUs.
The average \(CLR\) value inside sample \(jk\) is then \[\bar{z}_{.jk} = \frac{1}{m}\sum_{i=1}^{m}z_{ijk} = \frac{1}{m}\sum_{i=1}^{m}y_{ijk} - \frac{1}{m}\sum_{i=1}^{m}\bar{y}_{.jk} = \bar{y}_{.jk} - \bar{y}_{.jk} = 0\] Which is why it is called centered, the average \(CLR\)-value in a sample is always zero.
Does this cope with the bias that is the focus of
ANCOMBC
? The reason for this bias are the LSS \(\delta_{jk}\). From above we see that \[z_{ijk} = y_{ijk} - \bar{y}_{.jk}\] Thus,
the \(CLR\) values are the log-counts
minus ‘something’ that we may interpret as this bias. From further up we
saw that in the ANCOMBC
setting the estimate for the LSS
was \[\hat{\delta}_{jk} = \bar{y}_{.jk} -
\bar{y}_{.j.}\] Thus, if \(\hat{\delta}_{jk} = \bar{y}_{.jk}\) then
what we subtract in the \(CLR\) would
be the (estimate of) the ANCOMBC
bias.
Note that if we do the \(CLR\)
transform, we have no grouping of samples, each sample is treated
individually. In ANCOMBC
we have defined the sample groups,
and thereby can also make use of this information, resulting in a
slightly different normalization.
In all examples above we compared two groups of samples. In real life we may be interested in more complex comparisons, like
In the case of the Wilcoxon test, to cope with this we need to
replace it with the Kruskal-Wallis test, which is built on a
the same idea of using ranks rather than actual observed values. For
both the DESeq2
and the ANCOMBC
these
extensions are also possible. We will focus on the ANCOMBC
here now.
We can start with extending to a factor with more than 2 levels. Let
us simply re-run by using the bmi_category
as the grouping
variable, but use all data so that this is a factor with 5 levels (this
takes some time):
# Assume we have gut.phy from above
ancom_multi.obj = ancombc2(data = gut.phy,
fix_formula = "bmi_category",
p_adj_method = "fdr",
verbose = TRUE)
ancom_multi.tbl <- ancom_multi.obj$res
The resulting table is quite similar to what we saw previously, but
has many more columns. Since bmi_category
contains 5
different texts, it is converted into a factor with 5 levels. These are
ordered alphabetically unless you already converted to a factor
yourself, and gave the levels
a different ordering. With
alphabetical sorting the level lean
becomes level 1, and
hence the reference level. All the other 4 groups are now compared
against this, i.e. the q-values in the column
bmi_categoryoverweight
now indicates if the abundance in
the overweight
group is significantly different from in the
lean
group. This means the ordering of the groups become
more important! The first factor level is always treated as the
reference, and all others are compared against this. Thus, what we did
in the first exercise in this module now becomes important, since this
is how you decide what is the reference factor level.
How about if we still wanted to test if there is a difference between
overweight
and morbidobese
? In fact, all the
other pairwise comparisons in addition to those where we only compare
against lean
? We can get some results for these by
re-running the analysis, requesting pairwise testing. We will
try this, but use only a subset of 3 factor levels (lean
,
overweight
and morbidobese
) since these
computations take a lot more time:
# Assume we have gut.phy from above
gut_sub3.phy <- subset_samples(gut.phy, bmi_category %in% c("lean", "overweight", "morbidobese"))
ancom_pair.obj = ancombc2(data = gut_sub3.phy,
fix_formula = "bmi_category",
p_adj_method = "fdr",
group = "bmi_category", # define groups specifically here
pairwise = TRUE, # we want pairwise tests
verbose = TRUE)
ancom_pair_res.tbl <- ancom_pair.obj$res
ancom_pair_res_pair.tbl <- ancom_pair.obj$res_pair
Notice how we now need to specify the grouping by the
group
argument. The fix_formula
may be more
complex, and this is why we need to specify the grouping specifically
here. Also, we set pairwise = TRUE
.
If we look into the results table as before (in
ancom_pair.obj$res
) this is no different, we test against
the reference level lean
. The extra table in
ancom_pair.obj$res_pair
contains the results for the
additional comparisons as well. In this case the columns with
bmi_categoryoverweight_bmi_categorymorbidobese
will have
the results for comparing groups overweight
and
morbidobese
, none of them being the reference group.
How about a more complex model where we include more than one factor?
This is also straightforward in ancombc2
. Let us include
also the sex
column to see if some OTUs differ between male
and female as well:
ancom_2fac.obj = ancombc2(data = gut_sub3.phy,
fix_formula = "sex+bmi_category",
p_adj_method = "fdr",
verbose = TRUE)
ancom_2fac_res.tbl <- ancom_2fac.obj$res
Notice how we now changed the fix_formula
input, adding
two factor variables. This means we look for effects of both
sex
and bmi_category
on the abundances for
each OTU. If you are familiar with Analysis-of-variance (ANOVA) this is
exactly the same. We specify a linear model where we have main
effects of sex
and bmi_category
.
The results we get are of the same type as we have seen before, but
now we test for differential abundance both between female
and male
as well as between lean
and the the
other two bmi_category
s. The results will in general differ
from if you did this in two separate analyses, first using
sex
as grouping variable and then
bmi_overweight
. The reason is that the residuals will in
general become smaller now, and you fit the data better than if you did
two separate analyses. You can imagine that if we only grouped by the
sex
variable, then all variation which is due to the three
bmi_catgory
levels would be seen as ‘random noise’. By
including bmi_category
in the model as well, the ‘random
noise’ is now what is left over when both effects of sex
and bmi_category
has been subtracted. Less random noise
makes the picture clearer, and we get closer to seeing the actual
differences in abundances.
You may again also request pairwise testing, but you can only group
the data according to one variable. In this case it must be the
bmi_category
variable, since the sex
only 2
groups, and no additional pairs can be tested.
You may also add numerical variables, e.g. the Age
column in the current data. You may also add random effects in the
model, but since this is all beyond the prerequisites for this course we
will not study this here. Also, remember that by adding more and more
effects the result you get become more and more difficult to interpret.
There is no point in fitting a model that you do not understand.
Instead of focusing on the separate OTU or taxa, we may instead be interested in testing if there is any significant difference in the composition between groups of samples. For each sample we have a vector (row) of data indicating the (relative) abundance of the various OTUs in each sample. This is the composition or profile of the microbial community.
When we have data for several samples from the same treatment/condition/site these will typically be somewhat similar to each other, but not identical. The expected composition for that particular treatment/condition/site we may think of as an average of all the composition profiles we could have collected. Imagine we collected infinite many samples from some condition A, and then computed this expected composition. The composition for each sample are then variants of this, due to any kind of ‘noise’ that may be due to the biology itself or our way of sampling/sequencing/processing of the data. If we also did the same for some condition B, we could compare the expected profiles directly, and see if they are identical or not. Any tiny little difference would be enough to state they are not identical.
In reality we do not have these expected profiles, since we have only some, not infinite many, samples. We still want to test if there is a difference in the expected composition between the treatments/conditions/sites, but now based only on the data we have. This means we have to resort to some hypothesis testing and probabilistic statements again.
As always we must be aware of the null-hypothesis we are testing. If you report a p-value, and you do not know what you are actually testing, it becomes meaningless. Assume we have two conditions, A and B. Let \(\bf{\mu}_A\) and \(\bf{\mu}_B\) be the expected compositions for both treatments, respectively. Think of them both as vectors, the ‘average’ samples for the two conditions. Then our null-hypothesis is: \[ H_0: \bf \mu_A = \mu_B\]
The alternative, \(H_1\) is then
they are not identical, i.e. any (tiny) difference between them. If this
\(H_0\) is true, we expect the
composition profiles that we actually observe from condition A and B are
quite similar to each other. They need not be identical, we always see
some variation in composition profiles even within the same condition.
Thus, we need to compare the composition profiles from both conditions
and some clever way in order to shed some light on the null-hypothesis
above.
adonis2
testThe Analysis of Variance (ANOVA) you should have heard of in introductory statistics. In ANOVA we typically have a single response variable of observed data, and then we have one or more factors, i.e. categorical variable with two or more factor levels. These factors are typically something we have designed in some experimental design. A classical example is where we grow crops under various types of fertilizers. Then the response is the amount of crops we harvest in each case and the factor is the variable indicating which fertilizer type (each type is a new category) was used in each case. We have then done repeated experiments, having a number of samples for each category. In ANOVA we then compare variation in response between categories to those within categories, and from this we compute p-values.
The huge difference to our setting is that we do not have a single ‘response’, but rather entire vectors of many responses. If we were interested in the single OTUs, we are back to the differential abundance testing above. So how do we compare entire profiles?
Well, we saw some distance metrics, or beta-diversity measures, in module 11. They do exactly this. Using one of these, we get a distance matrix \(D\) with one row and column for each sample. If the samples from group A are distinctly different from the samples from group B, then we should observe that
In the test below we compute the between-group squared distances and compare this to the within-group squared distances. This is exactly the same idea as in an ANOVA F-test, except that there we use sum-of-squares instead of squared distances. But a sum-of-squares may be thought of as a squared euclidean distance. In the F-test we assume normal distributed data, and then the sum-of-square F-statistic become F-distributed. Since we now allow any distance metric, and the data may be far from normal, we have no idea of the true distribution of the statistic we compute. We may still compute such test-statistics, but in order to convert it to some p-value, we need to put this statistic into some probability density.
To resolve this the PERMANOVA instead uses a permutation test, which is also the reason for the name of this procedure (permutation based ANOVA). Here is the main idea:
If the null-hypothesis is true, and there is no real difference between group A and B, then samples may be re-assigned randomly (shuffled) to any group, and the between-versus-within statistics should come out more or less the same as before. This is the permutation, we simply re-assign our samples randomly to groups A and B. Some of the samples who actually belong to group A may now be in group B and vice versa. The number of samples in each group remains the same. After each re-assignment we re-compute the between-group and within-group distances, and repeat this many times. This forms a distribution, think of a histogram of these values. If the statistic with the original grouping becomes (much) different from the permuted ones, we conclude group A and B actually have significantly different compositions.
There is one pitfall in this test: If group A (or B) has huge within-group-distances compared to B (or A), then this test may also give us a small p-value, even if the ‘average’ composition is the same. We should therefore always also consider the within-group-distances separately for all groups, to verify these are not wildly different from each other.
This PERMANOVA procedure is implemented in the function
adonis2()
in the vegan
package. The
vegan
package should already be installed, since it is a
dependency for the phyloseq
, but if not, install it fram
CRAN.
Let us try this to test if we have a significant difference in
composition between the overweight
and
morbidobese
samples in our data.
First, we compute the Bray-Curtis distances for the data from above:
library(vegan)
# Assume we have gut.phy from above
gut_sub.phy <- subset_samples(gut.phy, bmi_category %in% c("overweight", "morbidobese"))
d.bc <- phyloseq::distance(gut_sub.phy, method = "bray")
Any distance metric may be used here, our choice of the Bray-Curtis is rather random.
Digression: Note the syntax phyloseq::distance()
above.
Functions named distance()
is found in several R packages,
and when we loaded the DESeq2
package above, the
phyloseq
version of this function was replaced by another.
By writing phyloseq::distance
we specify that we now want
the function distance()
from the phyloseq
package, and nothing else. You may sometimes need to specify this in R
code, if you load many packages.
Next, we need the sample_data
from our
phyloseq
object, this is for the grouping information only
(the bmi_category
column).
sample.tbl <- sample_data(gut_sub.phy) %>%
as.matrix() %>%
as.data.frame()
Then we can perform the test, using the function
adonis2()
in the vegan
package:
tst <- adonis2(d.bc ~ sample.tbl$bmi_category)
print(tst)
## Permutation test for adonis under reduced model
## Terms added sequentially (first to last)
## Permutation: free
## Number of permutations: 999
##
## adonis2(formula = d.bc ~ sample.tbl$bmi_category)
## Df SumOfSqs R2 F Pr(>F)
## sample.tbl$bmi_category 1 1.489 0.04049 9.0306 0.001 ***
## Residual 214 35.296 0.95951
## Total 215 36.786 1.00000
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Let us have a brief look at the results in the output table. Below I use the notation found in the original article describing the PERMANOVA test, see https://onlinelibrary.wiley.com/doi/10.1111/j.1442-9993.2001.01070.pp.x.
First, we compute the total sum of squared distances, i.e. we use all
distances. This we denote \(SS_T\), and
in the output above this has the value 36.786
(last value
under SumOfSqs
). The within squared distances, \(SS_W\), is 35.296
(second
value under SumOfSqs
). This results in the between group
sum of squared distances is \(SS_A = SS_T -
SS_W\), which is 1.489
here (first value under
SumOfSqs
). Does this mean the spread is larger within than
between? No, we have to divide by the degrees of freedom as well. Since
we have 2 groups, the number of parameters is \(a=2\). We also have \(N=216\) observations (samples). The degrees
of freedom are listed under Df
in the output. This means we
get the test statistic \[F = \frac{SS_A /
(a-1)}{SS_W / (N-a)} = \frac{1.489/1}{35.296/214} = 9.0306\] We
call the test statistic \(F\) even if
it is in general not F-distributed. It is still e measure of how large
the between-variation (numerator) is compared to the within-variation
(denominator). To get a p-value, we need to know how big the F-values
will usually get when \(H_0\) is true.
To unravel this we now make 999 permutations to get 999 new F-values,
and together with our original one, we now have 1000 such F-values. The
p-value is then simply the fraction of these who are at least as large
as the original one at 9.0306
. Here the p-value (under
Pr(>F)
) becomes 0.001
, which means the only
one among the 1000 F-values who was at least as large as the original
was the original itself. None of the permuted values were larger. This
is the smallest p-value we can get with 999 permutations. You may try
more permutations if you like, just to see if any permuted F-value could
ever become larger than the original at 9.0306
.
From the small p-value we may conclude that the groups differ in composition.
However, before we conclude, we should inspect the
within-group-distances, since we know that if the groups differ much in
variation, this may effect the outcome of this test. The
vegan
package also has a function for computing the spread
(dispersion) within each group, the betadisper()
:
within.disp <- betadisper(d.bc, sample.tbl$bmi_category)
print(within.disp)
##
## Homogeneity of multivariate dispersions
##
## Call: betadisper(d = d.bc, group = sample.tbl$bmi_category)
##
## No. of Positive Eigenvalues: 94
## No. of Negative Eigenvalues: 118
##
## Average distance to median:
## morbidobese overweight
## 0.4202 0.3882
##
## Eigenvalues for PCoA axes:
## (Showing 8 of 212 eigenvalues)
## PCoA1 PCoA2 PCoA3 PCoA4 PCoA5 PCoA6 PCoA7 PCoA8
## 8.387 5.600 4.177 3.568 2.978 1.741 1.204 1.054
We notice the ‘Average distance to median’ is quite similar for both groups. To actually test if this is a significant difference we may use
tst.w <- permutest(within.disp)
print(tst.w)
##
## Permutation test for homogeneity of multivariate dispersions
## Permutation: free
## Number of permutations: 999
##
## Response: Distances
## Df Sum Sq Mean Sq F N.Perm Pr(>F)
## Groups 1 0.02016 0.020164 1.8431 999 0.165
## Residuals 214 2.34110 0.010940
And we notice the p-value is large. Groups overweight
and morbidobese
have comparable within-group-distances, and
the reason for the small p-value in the adonis2()
test must
be due to a significant difference in composition between these
groups.
Again we may want to extend our analysis to comparing more than 2
groups and/or combine several factor variables. This is straightforward,
by extending the formula we use to specify the model in the
adonis2()
function.
bmi_category
s and
sex
Include all data from gut.phy
, and compute the
bray-Curtis distances in this case. Then, run the adonis2()
to test for difference in composition for both sex
and
bmi_category
. As for ANCOMBC
you simply add
the two factor variables in the formula.
There is no pairwise testing option here. Discuss what it means if
the variable bmi_category
(having 5 levels or groups) comes
out as significant (small p-value)?
d5.bc <- phyloseq::distance(gut.phy, method = "bray")
sample.tbl <- sample_data(gut.phy) %>%
as.matrix() %>%
as.data.frame()
tst <- adonis2(d5.bc ~ sample.tbl$bmi_category + sample.tbl$sex)
print(tst)
## Permutation test for adonis under reduced model
## Terms added sequentially (first to last)
## Permutation: free
## Number of permutations: 999
##
## adonis2(formula = d5.bc ~ sample.tbl$bmi_category + sample.tbl$sex)
## Df SumOfSqs R2 F Pr(>F)
## sample.tbl$bmi_category 5 5.454 0.03157 6.7456 0.001 ***
## sample.tbl$sex 1 0.907 0.00525 5.6108 0.001 ***
## Residual 1029 166.395 0.96318
## Total 1035 172.756 1.00000
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
A small p-value for a multi-category variable simply means at least one of the groups have a composition differing from at least one other group. It says nothing about which group differ from which. To get this information, you need to test them against each other by sub-setting the data and do the pairwise analysis we started out with.