1 Learning goals

We end by a module on how we may perform and understand the basics of some hypothesis testing on microbiota read count data.

  • Testing for differential abundant OTUs. This means we have samples from various groups and we are interested in OTUs who are systematically more or less abundant between groups. The groups may for example be we have sampled the microbial communities in the gut from some healthy and some diseased people, one sample for each person. Finding the differential abundant (DA) OTUs between healthy and diseased may help us either diagnose people or even understand what may be the reason for the disease.
  • Testing for differential composition using a PERMANOVA test. The setting is similar to above, but instead of testing for single OTUs being of differential abundance we ask if the entire composition changes between groups.



1.1 Software used in this module

In this last module we DO NOT USE ORION at all. There are two good reasons for this:

  • The data we handle are inside phyloseq objects, and these are small enough to be handled by most computers. No need for HPC anymore.
  • The R packages we need are a severe hassle to install on our current RStudio solution for BIN310.

Thus, you should install R and RStudio on your own computer (you probably already have this), and then install the following R packages from Bioconductor:

This may take quite some time, since there will be many dependencies. As usual, we also use the tidyverse packages a lot below (wrangling and plotting).





2 Human gut data

Let us use a data set from the human gut. Here the OTUs have been classified to the genus rank, and the genus names are as the ‘OTU names’. The data are found inside a phyloseq object that you may load into R by

library(tidyverse)
library(phyloseq)
load(url("https://arken.nmbu.no/~larssn/teach/bin310/gut.phy.RData"))



2.1 Know your data

Before you start analyzing (any) data, you must have a minimum of insight into the size and structure of the data set. Let us start by repeating some stuff from module 11 and get some insight into the data.

2.1.1 Exercise - the samples

Extract the sample_data from the phyloseq object, and store it in sample.tbl. What is the data type of sample.tbl?

The people behind the phyloseq package has decided to make ‘special’ data types for data structures where we could easily have used an ordinary table or matrix. This makes it a hassle to work with in some respects. Convert the sample.tbl to an ordinary data.frame. Hint: First convert it to a matrix, then to data.frame.

Inspect this table. How many samples are there?

There are 4 columns of data about the samples. What data types (numerical, text, factor?) are these? Hint: Use the summary() function.

We want to have

  • sex as a factor.
  • age as numeric.
  • nationality as a factor.
  • bmi_category as a factor with levels in this order: c("underweight", "lean", "overweight", "obese", "severeobese", "morbidobese")

Make these changes to the sample data table. Then, make an updated phyloseq object with this table as the sample_data.

Keep the sample.tbl as a separate object as well, we will use it again later.

2.1.2 Exercise solution

sample.tbl <- sample_data(gut.phy) %>%
  as.matrix() %>%
  as.data.frame() %>%
  mutate(sex = factor(sex)) %>%
  mutate(age = as.numeric(age)) %>%
  mutate(nationality = factor(nationality)) %>%
  mutate(bmi_category = factor(bmi_category, levels = c("underweight", "lean", "overweight", "obese", "severeobese", "morbidobese")))
gut.phy <- phyloseq(sample_data(sample.tbl),
                    otu_table(otu_table(gut.phy)),
                    tax_table(tax_table(gut.phy)))

2.1.3 Exercise - the read count table

Extract the read count table (OTU table) from the gut.phy object, and store in otu.mat. What data type is this?

Convert it to a matrix. How many OTUs are there in this data set?

Compute the library sizes for all the samples, and make a histogram of them (use sample_sums() or colSums()).

Keep the otu.mat as a separate object, we will use it again later.

2.1.4 Exercise solution

otu.mat <- otu_table(gut.phy) %>%
  as.data.frame() %>%
  as.matrix()
#View(otu.mat)
cat("There are", nrow(otu.mat), "rows, i.e. OTUs in this data matrix\n")
## There are 130 rows, i.e. OTUs in this data matrix
library.sizes <- sample_sums(gut.phy) # alternative 1
library.sizes <- colSums(otu.mat)     # alternative 2
ggplot() +
  geom_histogram(aes(x = library.sizes), bins = 25)

2.1.5 Exercise - alpha diversity

We have heard that the alpha diversity of the gut may be important for your health, i.e. a too low diversity is bad. Make a plot where you display the Shannon alpha diversity such that we can see if there is an association between this an the bmi_category variable.

2.1.6 Exercise solution

A straightforward richness plot may be used, but we also add a layer of boxplot to see where the majority of samples are in each category:

plot_richness(gut.phy, x = "bmi_category", measures = "Shannon") +
  geom_boxplot()

There is some tendency that diversity goes down as obesity increases.

This figure is sub-optimal since it displays both all data as point and then boxplots on top. In a report I would make it like this:

estimate_richness(gut.phy, measures = "Shannon") %>%
  bind_cols(sample.tbl) %>%
  ggplot() +
  geom_boxplot(aes(x = bmi_category, y = Shannon), fill = "grey") +
  labs(x= "BMI category", y = "Alpha diversity (Shannon)")

Notice that the BMI categories are now sorted in the ‘correct’ order along the x-axis. This is because we made it into a factor with levels in that exact order. If we had not done that, ggplot() would have sorted them alphabetically, which is not a good way to display them. Categories who have some ordering should always be displayed in that order!

This ordering of the categories is also very important for two of the methods we will use below.





3 Differential abundance testing

In the data above we have collected metabarcoding data from the gut of some people. We may then group these people into categories, we saw this above. We may then be interested in: Are there any OTUs who are consistently different in abundance between these groups? Identifying such OTUs may be of interest in order to understand more about the interplay between host and gut microbiota.

In general we have sampled some microbial communities from various treatments/conditions/sites, which means we can group our samples by this knowledge. This means we have an OTU table (otu_table in phyloseq) with (normalized) read counts, and if the samples are the columns, we may group the columns. The samples are grouped by our knowledge of the various treatments/conditions/sites that our samples are from. We typically have this type of information in our sample table (sample_table in phyloseq).

Let us now focus on one OTU at the time, and we denote this OTU number \(i\). Let us assume there are two groups only, and we denote them A and B. Formally, we may think like this: If we had infinite many samples from both groups, we could compute the expected abundance for OTU \(i\) for both groups by averaging over the infinite many samples. If these expected values are identical there is no differential abundance for this OTU between the two conditions. If we denote these expected abundances \(\mu_{i,A}\) and \(\mu_{i,B}\), respectively, we have the null hypothesis: \[H_0: \mu_{i,A} = \mu_{i,B} \] In reality we do not have infinite many samples, and we have to make some probabilistic statement based on the data we have, but we must be aware this is the hypothesis we now test.

Note that there will now be a new hypothesis test for each new OTU, \(i=1,2,...\). We simply have to loop over all OTUs, and perform the same test again and again based on the data for each OTU. In the cases where we can reject \(H_0\) we have found something that seems to differ consistently between the two conditions. These are the OTUs we find interesting.

3.1 A word on normalization

Notice that when testing for differential abundance, we have to normalize the raw read counts in some way. If not, the number of reads we have sequenced will influence the outcome of the test, which is nonsense. It is not like people are diseased because you sequenced their gut bacteria with more reads than you did for the healthy people!

This means the data in our OTU table must either already be normalized in some way, or the procedure we employ do this normalization as part of the method itself.



3.2 The Wilcoxon test

Assume we have the \(TSS\) normalized data, which means we have relative abundance data and the sum of abundances for each sample is 1.0. A rarefaction would also have a similar effect. A very simple approach to the differential abundance testing is then to employ the two-sample t-test for each OTU. You should be familiar with this from introductory statistics.

However, this is not commonly used, simply because it assumes the data are normal distributed. We know that data of this type (read counts) often is very far from normal, with many smallish values and some very big ones. But, this depends on you normalization, and we will not state that a t-test cannot be used. The test is actually quite robust to the assumption of normality.

An alternative to the t-test, and equally simple, is the Wilcoxon two-sample test. This is a non-parametric test meaning it makes no assumptions about the distribution of the data. Instead, the data are just sorted, and their ordering (rank) is used in the test. Note that if \(H_0\) is true, then the values from the two groups should come in a rather mixed order after sorting. If you simply sum the ranks for group A and B, they should come out fairly similar. This is what we use in the Wilcoxon test. If, say, all the abundances from group A are larger than all the values from group B, the sum of orders become very different, indicating we can reject \(H_0\) and claim we have found a differential abundance. The p-value come from computing how often we would see a difference in orders at least as large as the one we compute, given that \(H_0\) is true and that we sample at random the orders of the samples from the two groups.

The reason we do not always use this test, instead of the t-test, is that it has less power than the t-test. This means we need to have many samples in order to reject \(H_0\) with this approach. Also, it does not use the actual abundance values, only their ordering. This means huge variances are not ‘seen’ in the same way as for instance in a t-test. Also, for metabarcoding data there are some other biases who are ignored by this simple approach, and it may tend to come out with slightly too many differential abundant taxa. However, the Wilcoxon test is simple to understand and always something you should have in your ‘toolbox’. We use it here as an entry into this topic, simply because everything is very transparent and easy to follow.

3.2.1 Exercise - run the wilcox.test()

We saw in an exercise above that the diversity seemed to be lower in people from the category morbidobese compared to the overweight. Let us see if we can identify some OTUs who have consistently different abundances between these two categories.

Let us make the code required to run the Wilcoxon test on each OTU in the gut data set, and use its relative abundance values from the categories overweight and morbidobese only.

First, use the transform_sample_counts() on the gut.phy object to get relative abundances in all samples (this is the \(TSS\) normalization from module 11). Next, use the subset_samples() function to extract only the samples where bmi_category %in% c("overweight", "morbidobese"). Store the result in a new phyloseq object.

Extract the sample.tbl and otu.mat from this new phyloseq object, see exercise above.

To perform the Wilcoxon test on the first OTU, the code is

tst <- wilcox.test(otu.mat[1,] ~ sample.tbl$bmi_category)

The p-value for this test is then found in tst$p.value.

Make the code to run the test for all OTUs (not only the first), and collect the p-values as a column in a table. This table should also have a corresponding column with the OTU-labels such that we know which OTU belongs to which p-value.

Here is a code skeleton you may start with:

# Assume we have the gut.phy from above
# We subset, using only the two groups of interest
gut_sub_norm.phy <- transform_sample_counts(___, function(x){x / sum(x)}) %>% 
  subset_samples(___)

# We collect the sample.tbl and otu.mat from the phyloseq object
sample.tbl <- sample_data(___) %>% 
  as.matrix() %>% 
  as.data.frame()
otu.mat <- otu_table(___) %>% 
  as.data.frame() %>% 
  as.matrix()

# Time to run the tests!
wilcox.tbl <- data.frame(OTU = rownames(___),
                         p_value = -1)
for(i in ___){
  tst <- wilcox.test(___[i,] ~ sample.tbl$bmi_category)
  wilcox.tbl$p_value[i] <- ___
}

3.2.2 Exercise solution

# Assume we have the gut.phy from above
# We subset, using only the two groups of interest
gut_sub_norm.phy <- transform_sample_counts(gut.phy, function(x){x / sum(x)}) %>%
  subset_samples(bmi_category %in% c("overweight", "morbidobese"))

# We collect the sample.tbl and otu.mat from the phyloseq object
sample.tbl <- sample_data(gut_sub_norm.phy) %>%
  as.matrix() %>%
  as.data.frame()
otu.mat <- otu_table(gut_sub_norm.phy) %>%
  as.data.frame() %>%
  as.matrix()

# Time to run the tests!
wilcox.tbl <- data.frame(OTU = rownames(otu.mat),
                         p_value = -1)
for(i in 1:nrow(otu.mat)){
  tst <- wilcox.test(otu.mat[i,] ~ sample.tbl$bmi_category)
  wilcox.tbl$p_value[i] <- tst$p.value
}

At least we get some small p-values (well below 0.05), so perhaps we have found some interesting OTUs here?



3.3 Multiple testing

Whenever we make a series of hypothesis tests, and ‘search’ for interesting findings in a systematic way like we did above, it is customary to adjust the p-values before we make any statements about where we reject the null-hypothesis and claim we have a significant differential abundance.

The idea is simply that since the p-values are probabilities, and we draw a threshold at some probability value, there is always some probability of false ‘findings’. If we use the ordinary threshold of 0.05 (5%) then, even if the null-hypothesis is actually true (no differential abundance) for all the OTUs, we will on average find that for 5% of them the data just happen to look such that the test ends with a p-value below 0.05. If we have many OTUs, then 5% means we may end up with quite many such false cases. To protect us against this, we should be stricter than normal, but how strict?

It is customary to adjust the p-values and keep the same threshold (e.g. 0.05). The adjustment may be done in several ways, but we will not dig into this here. Instead, we opt for the adjustment called controlling the false discovery rate. The idea behind this adjustment may be understood as follows: If you use the raw p-values and the significant threshold at 5%, you must expect to get 5% false positives among all tests (OTUs). If you use the same threshold on the adjusted q-values, you must expect 5% false positives among those you claim to be positives. Clearly the latter gives fewer false positives.

This means we replace all p-values with the adjusted version, often referred to as q-values, and then say the OTUs we look for are those where the q-value is below 0.05 (or whatever threshold we choose). The q-values are always at least as large as the p-values, and typically fewer OTUs will come out with a q-value below 0.05 than a p-value below 0.05. Thus, we are stricter.

In R we have a function p.adjust() that has implemented a range of methods for adjusting the p-values. To use it you

  • First compute all p-values.
  • Then use all p-values as input to p.adjust().
  • Specify the method option, e.g. method="fdr" to use the false discovery rate adjustment.
  • The function give you the corresponding q-values as output.

Please note that this procedure will only adjust the p-values you have. It is no guarantee against false findings. If the p-values are rubbish from the start, this adjustment will not change that fact!

3.3.1 Exercise - adjust Wilcoxon p-values

Compute the q-values from the p-values we got above, and store these in a new column q_value in the same table as we made above. It is also a good idea to sort the table by this q_value column.

How many OTUs are differential abundant between categories overweight and morbidobese if you use the 5% threshold on the computed q-values?

3.3.2 Exercise solution

Add to the previous script:

wilcox.tbl <- wilcox.tbl %>%
  mutate(q_value = p.adjust(p_value, method = "fdr")) %>%
  arrange(q_value)
cat("We find", sum(wilcox.tbl$q_value < 0.05), "OTUs have a differential abundance")
## We find 61 OTUs have a differential abundance

3.3.3 Exercise - differing abundance

If an OTU has a small enough q-value we can claim it has a differential abundance between overweight and morbidobese. But, under which category is it more/less abundant? We should also keep this information in our table. Add two new columns where you add the average relative abundance for each of these two groups for each OTU.

3.3.4 Exercise solution

This may be solved in many ways, e.g. by pivot_longer() and group_by(). Here is an alternative way:

overweight.cols <- which(sample.tbl$bmi_category == "overweight")
morbidobese.cols <- which(sample.tbl$bmi_category == "morbidobese")
wilcox.tbl <- wilcox.tbl %>%
  mutate(overweight_mean = rowMeans(otu.mat[,overweight.cols])) %>%
  mutate(morbidobese_mean = rowMeans(otu.mat[,morbidobese.cols]))

You may also want to mutate in a column difference = overweight_mean - morbidobese_mean so that its signs quickly tells you in which category it is more abundant.



3.4 A word on relevance

When we go hunting for differential abundance OTUs, we may easily find that many OTUs come out with very small q-values, but if we look at their abundances they are sometime very low abundant in both groups. We actually detect the difference between ‘very low’ and ‘super low’.

How interesting are these OTUs really? This is something we always have to ask ourselves. An organism with very low abundance will in most cases not be very important for the environment as a whole, but there are exceptions, and you need to dig more into this.

My point here is that finding a significant differential abundant OTU is a necessary but not a sufficient criterion for reporting it as interesting. For this reason, it is not uncommon to also have a threshold on the average abundance as well, and only report differential abundances for OTUs with abundance above this threshold.



3.5 The DEseq2 approach

Testing if OTUs have a differential abundance between two conditions is very similar to testing if genes have differential expression between two groups of samples. Gene expression data is not something we talk about here in BIN310, since BIN315 is about this topic. We can just say that gene expression is typically measured by sequencing of the mRNA, what is known as RNAseq data. These reads are then mapped to the known genes of the organism(s) and counted, and we get a read count table very similar to what we have for metabarcoding data. Instead of read counts for each OTU we have read counts for each gene, and this is arranged in a table similar to our OTU table. Just replace OTUs with genes, and the data structure is the same. The higher/lower gene expression is then similar to a higher/lower OTU abundance.

There are some commonly used tools for testing if genes are differential expressed. We will not dig into these methods, this belongs more to a course in gene expression data analysis (BIN315). But we will take a look at how we can use one of these tools to also test for differential abundance in metabarcoding data, more specifically the R package DESeq2. The people behind the phyloseq package also has a suggested procedure for doing this testing that we follow here now.

When using the DESeq2 approach it is essential that you use non-normalized raw data as input since the procedure actually implicitly makes its own type of normalization. Without going into the details, we can say that here read count data are treated as data following a negative binomial distribution, and the model is fitted accordingly. A negative binomial is similar to a Poisson distribution, but with larger variance. In a Poisson the variance equals the expectation, they are one and the same thing, but in the negative binomial the variance is a parameter independent of the expectation just like in a normal distribution. Negative binomials are commonly used for count data when the data variation is too large to be explained by a Poisson.

We again use only the subset of samples from the two categories we used above:

# Assume we have gut.phy from above
gut_sub.phy <- subset_samples(gut.phy, bmi_category %in% c("overweight", "morbidobese"))

Next, we convert our phyloseq object to a DESeq2 object:

library(DESeq2)
gut_sub.dsq <- phyloseq_to_deseq2(gut_sub.phy, ~ bmi_category)

Notice that already here we specify the design of the linear model to be fitted, i.e. we specify that the grouping we are interested in is described by the bmi_category variable in our sample_data.

Next, we need some geometric means in order to estimate some size factors. This is actually the \(RLE\) normalization that we saw in module 11!

gm_mean <- function(x, na.rm = TRUE){                # local function
  exp(sum(log(x[x > 0]), na.rm = na.rm) / length(x)) # for computing geometric mean
}
gmeans <- apply(counts(gut_sub.dsq), 1, gm_mean)       # counts(gut_sub.dsq) is the read count matrix
gut_sub.dsq.norm <- estimateSizeFactors(gut_sub.dsq, geoMeans = gmeans)  # the RLE normalization

The last step is the actual fitting of the linear model, and computations of all results

gut_sub.dsq.norm.fit <- DESeq(gut_sub.dsq.norm, fitType="local")

The results() function will extract some results from the complex gut_sub.dsq.norm.fit object. It will return a list, that we immediately convert into a table:

deseq.tbl <- results(gut_sub.dsq.norm.fit, pAdjustMethod = "fdr") %>% 
  as.data.frame()

This table lists all OTUs, and each column is some result from the DESeq2 procedure. The last two columns are raw p-values and adjusted p-values. Notice we used the same p-value adjustment here as we did earlier(the fdr). The second column is the log fold change or log-ratio. If this is (large) positive, it means this OTU is more abundant in morbidobese compared to overweight. If it is (large) negative, it is opposite. Note the ordering of the categories! If you did the first exercise in this module the bmi_category should a factor where the level overweight is listed before morbidobese. If you did not do that exercise, the ordering will be opposite, since then DESeq2 will simply use the alphabetical ordering! The DESeq2 will always compare all other levels against first factor level. By explicitly converting the grouping variable into a factor yourself, you control which group is the ‘reference-group’ (first level). We will see this again below, with the ANCOMBC approach.

3.5.1 Exercise - compare to Wilcoxon test

We want to make a plot of the Wilcoxon q-values for each OTU against the DESeq2 q-values, and color this according to if it is below 0.05 for both or one of them. We also make the marker size depending on the average abundance for each OTU. In this way we can how the two methods agree/disagree.

First, from the res.tbl above, select the padj column, and add a new column OTU containing the row names of res.tbl.Then join to the wilcox.tbl we made earlier. Join by the "OTU" column.

Filter the rows (OTUs) to keep only those where at least one of the methods indicate differential abundance, i.e. either padj or q_value, or both, is below 0.05.

Add a new column significance indicating if both methods agree on significance, if only Wilcoxon or only DESeq2 claims this. The if_else() function is nice for doing this.

Add also a column abundance with the average of the abundances. For each OTU. We already computed this for overweight and morbidobese separately, now you only need to average these two.

Finally, plot padj versus q_value as points, color by significance and let the size be mapped to the abundance. Here is a code skeleton:

both.tbl <- deseq.tbl %>% 
  select(___) %>% 
  mutate(OTU = ___) %>% 
  right_join(wilcox.tbl, by = ___) %>% 
  filter(padj < ___ | ___ < 0.05) %>% 
  mutate(significant = "both") %>% 
  mutate(significant = if_else(___ > 0.05, "DESeq2", significant)) %>% 
  mutate(significant = if_else(___ > 0.05, "Wilcoxon", significant)) %>%
  mutate(significant = factor(significant, levels = c("Wilcoxon", "both", "DESeq2"))) %>% 
  mutate(abundance = (___ + overweight_mean) / 2)
fig <- ggplot(___) +
  geom_point(aes(x = ___, y = ___, color = ___, size = ___)) +
  geom_hline(aes(yintercept = 0.05)) + geom_vline(aes(xintercept = 0.05)) +
  scale_x_log10() + scale_y_log10() +
  labs(x = "Wilcoxon q-value", y = "DESeq2 q-value", size = "Mean relative\nabundance")
print(fig)

To what extent do these two approaches agree on what is a differential abundant OTU?

3.5.2 Exercise solution

both.tbl <- deseq.tbl %>%
  select(padj) %>%
  mutate(OTU = rownames(deseq.tbl)) %>%
  right_join(wilcox.tbl, by = "OTU") %>%
  filter(padj < 0.05 | q_value < 0.05) %>%
  mutate(significant = "both") %>%
  mutate(significant = if_else(q_value > 0.05, "DESeq2", significant)) %>%
  mutate(significant = if_else(padj > 0.05, "Wilcoxon", significant)) %>%
  mutate(significant = factor(significant, levels = c("Wilcoxon", "both", "DESeq2"))) %>%
  mutate(abundance = (morbidobese_mean + overweight_mean) / 2)
fig <- ggplot(both.tbl) +
  geom_point(aes(x = q_value, y = padj, color = significant, size = abundance)) +
  geom_hline(aes(yintercept = 0.05)) + geom_vline(aes(xintercept = 0.05)) +
  scale_x_log10() + scale_y_log10() +
  labs(x = "Wilcoxon q-value", y = "DESeq2 q-value", size = "Mean relative\nabundance")
print(fig)

If we consider the smallest q-values from both methods, these OTUs are claimed differential abundant in both cases. In addition to these, the Wilcoxon test gave more small q-values than the DESeq2 approach here.



3.6 The ANCOMBC

The R package ANCOMBC is one out of several R packages purpose made for metabarcoding data analysis, and with emphasis on differential abundance testing. If you were to pick only one tool for this job, this is a good candidate.

Let us first use the tool to see how we can get some differential abundance testing results, and we use the exact same data as we did above. Extend the R script we have from above with:

library(ANCOMBC)
# Assume we have gut.phy from above
gut_sub.phy <- subset_samples(gut.phy, bmi_category %in% c("overweight", "morbidobese"))
ancom.obj = ancombc2(data = gut_sub.phy,
                     fix_formula = "bmi_category",
                     p_adj_method = "fdr",
                     verbose = TRUE)
ancom.tbl <- ancom.obj$res

First, notice the function we actually use is named ancombc2(). Here we used a minimum of input, there are many input arguments to this function you may explore. The fix_formula is needed to indicate what you want to test. By naming the column bmi_category we state we want to fit a linear model with the factor (2 levels) in this column as explanatory variable. This is the same as testing if the abundance is different between these two levels (morbidobese and overweight). The p_adj_method is simply the method in p.adjust() to use for adjusting the p-values, like we saw above. Here we use the "fdr" to make it identical to what we did earlier. The verbose you can turn on/off (TRUE/FALSE) to get some printout during computations.The computations take some time.

Let us inspect the table with results:

View(ancom.tbl)

What we would be interested in here are the columns

  • lfc_bmi_categoryoverweight
  • q_bmi_categoryoverweight

The lfc is short for log fold-change, while the q value is the adjusted p-value. What is meant by bmi_categoryoverweight? It is the name of the fix_formula column with one factor level appended. The column bmi_category is a text column (unless you did the very first exercise in this module). Then, ancombc2 will convert it to a factor, with the two levels ("morbidobese" and "overweight") sorted alphabetically, i.e. with morbidobese as the first level. The first factor level is always used as a reference level by ancombc2, which is standard behavior for many functions in R, and we saw it also for the DESeq2 above. Thus, the q-values under q_bmi_categoryoverweight are for testing if the abundance changes when going from the reference level (=morbidobese) to the level overweight. lfc_bmi_categoryoverweight lists how much the abundance changes when going from the reference level (=morbidobese) to overweight. It is a log-transformed fold-change, i.e. you divide the overweight abundance by the reference (=morbidobese) abundance, and then take the logarithm. There are no columns for morbidobese since all information is in comparing against the reference.

NB!. If you did the first exercise in this module, you specifically changed the bmi_category to a factor, and also listed the factor levels. Then overweight will come before morbidobese and instead of alphabetical ordering these levels are now used. Then, the reference becomes overweight, and you get output columns like lfc_bmi_categorymorbidobese instead. The q-values will be the same, but the log-fold change value will have opposite signs.

3.6.1 Exercise - understanding log-fold change

In the results above it seems like the lfc_bmi_categoryoverweight for Collinsella is 1.02. Does this mean the abundance for Collinsella is larger or smaller in morbidobese compared to overweight persons? How much larger/smaller? Use the information about reference level and the number 1.02 to answer this.

Then, make a small R code to actually plot the raw data, and see if this agrees with the results.

3.6.2 Exercise solution

Let \(A_o\) be the expected Collinsella abundance for overweight persons, and \(A_m\) similar for morbidobese. Since the latter is the reference here, we get that the log fold-change is \[lfc = \log\left(\frac{A_o}{A_m}\right) = 1.02\]

By using the exponential function on both sides and rearranging the equation we get \[A_o = e^{1.02}\cdot A_m = 2.77\cdot A_m\] Thus, we can see that \(A_o\) is larger than \(A_m\) and we can state Collinsella is almost three times more abundant in overweight persons compared to morbidobese persons.

otu.tbl <- otu_table(gut_sub.phy) %>%
  as.matrix() %>%
  t() %>%
  as.data.frame()
collinsella.tbl <- otu.tbl %>%
  mutate(sample_id = rownames(otu.tbl)) %>%
  select(sample_id, Collinsella)
sample.tbl <- sample_data(gut_sub.phy) %>%
  as.matrix() %>%
  as.data.frame() %>%
  full_join(collinsella.tbl, by = "sample_id")
ggplot(sample.tbl) +
  geom_boxplot(aes(x = bmi_category, y = Collinsella))

Yes, it looks like the Collinsella abundances are in general larger for overweight compared to morbidobese persons. The actual number are of course here heavily influenced by library sizes, so we cannot from the raw data figure alone say how much more abundant. For this we need some estimates, like we get from ancombc2.

3.6.3 Exercise - compare to previous results

If you again use the adjusted p-value threshold at 0.05, which taxa come out as differential abundant according to ANCOMBC? Compare this to what we saw for the Wilcoxon test and DESeq2 from above.

3.6.4 Exercise solution

Here we choose to display the results inn a Venn diagram, and use the R package ggvenn:

all.tbl <- wilcox.tbl %>%
  select(OTU, wilcox_q = q_value)
all.tbl <- deseq.tbl %>%
  select(deseq_q = padj) %>%
  mutate(OTU = rownames(deseq.tbl)) %>%
  right_join(all.tbl, by = "OTU")
all.tbl <- ancom.tbl %>%
  select(ancom_q = q_bmi_categorymorbidobese) %>%
  mutate(OTU = rownames(ancom.tbl)) %>%
  right_join(all.tbl, by = "OTU")

library(ggvenn)  # install from github
DA.lst <- list(Wilcoxon = all.tbl$OTU[all.tbl$wilcox_q < 0.05],
               DESeq2 = all.tbl$OTU[all.tbl$deseq_q < 0.05],
               ANCOMBC = all.tbl$OTU[all.tbl$ancom_q < 0.05])
ggvenn(DA.lst)

We observe that for 33 OTUs all three methods agree these are differential abundant (DA). As many as 22 OTUs are only found DA by the Wilcoxon test, indicating perhaps this has a tendency to produce slightly too many false positives? If we only compare DESeq2 to ANCOMBC we find they agree on 37, but 7 are unique to DEseq2 while 4 are unique to ANCOMBC. If we require that at least 2 out of 3 methods must agree, then we have 33+4+4+2 = 43 cases of DA OTUs here.



3.7 Understanding ANCOMBC

Let us dig a little into how the ANCOMBC method works. A comprehensive understanding of this requires much more statistics than what is the prerequisits for this course, and we will therefore only try to establish a rather shallow overview.

There are two scientific papers describing the methods used: https://pubmed.ncbi.nlm.nih.gov/29163406/ and https://pubmed.ncbi.nlm.nih.gov/32665548/. The latter is the main paper describing the method we will use below, but it builds partly on the first paper.

3.7.1 The zeros

In module 11 we also talked briefly about this, here is a recap and extension from the first paper above.

What is important in the first paper is the treatment of zeros in the data. Metabarcoding data often contain many zeros, i.e. some OTU has 0 read counts in some sample. Such zeros arise for three distinct reasons, and for differential abundance testing should be treated different depending on this:

  • Structural zero. This means the OTU is actually absent in the sample. No matter how deep we sequence, it will never appear in the data. These are the ‘real zeros’, i.e a zero read count is the correct value.
  • Sampling zeros. This means the OTU is actually present in the data, but we have failed to detect it. Think of it as if we have not sequenced deep enough, had we sequenced deeper we would have seen it. In these cases the correct value is not zero, but rather some small positive value indicating these OTUs are actually present in small amounts.
  • Outlier zeros. These have become zero because of some ‘technical’ error. In this case the zeros should be replaced by NA indicating we have no reading at all here.

Why is this important? Well, for good reason we would like to log-transform the read counts during the analysis. But, then we cannot have any zeros in our data, since the logarithm of zero is not defined (‘minus infinity’).

Imagine our samples are grouped into two sample groups (e.g. overweight and morbidobese from above), i.e. we have \(g=2\) sample groups. The fundamental assumption of the grouping and testing is that all samples in the same sample group come from the same underlying, ‘true’, composition, and any variation between them is random noise. We expect the (normalized) abundances within the group to be similar, but not identical. Consider the data for OTU number \(i\), i.e. the corresponding row in the OTU table. Then we have read counts from the first sample group and some for the second.

An outlier zero for some OTU is defined as a zero that occurs in a few samples within a sample group when the other values for the same OTU, and in the same sample group, are not nearly zero at all. If an outlier zero is present, that observation is simply discarded from the data set, and we ‘loose’ one sample for OTU number \(i\). We do not discard entire samples, this is done for each OTU separately.

A structural zero for OTU number \(i\) means all read counts within a sample group are actually zero. This is detected when most values within the sample group are zero, but a few are not. It is in some sense the opposite of the outlier zeros. In this case the testing becomes really simple: If both sample groups have structural zeros for an OTU, then there is no differential abundance, simply because there is no abundance for that OTU! If one group is structurally zero and the other is not, then we have a differential abundance, simply because the OTU is present in one group but absent in the other. Thus, such OTUs can now be concluded upon right away, and are they are taken out of the further analysis.

The sampling zeros are those we are left with who are neither outlier nor structural zeros. For these we impute a small positive value, and since we are dealing with read counts, the value 1 is natural. Thus, we add one extra read to all read counts, i.e. we ‘donate’ one extra read to all read counts, assuming those who are now zero would have got 1 read had we sequence a little deeper. This of course introduces a small bias in our data, but we have bigger biases to deal with than this, so we consider this a reasonable choice.

Once this procedure is done, we can now log-transform all read count values, resulting in data with a distribution which is closer to the normal distribution or other well known distributions we have a machinery for in statistical theory.

3.7.2 Compositional data

The name, ANCOMBC, is an acronym for “ANalysis of COMpositional data with Bias Correction”. We mentioned compositional data briefly in module 11, related to the \(CLR\) transform.

By compositional data we refer to data where the relative composition (between OTUs) is informative, while the total sum in each sample is not. Such type of data occur in many fields of science, and here we see an example. If you sum all read counts for a sample, you find this library size varies a lot between samples. This has nothing to do with how many bacteria of various taxa there are in the samples from the start. First, we extract DNA from the environment, and this is not equally effective each time we do it. Second, we do PCR amplifications. The number of PCR-cycles may also vary, and a sample with few bacteria may end up with many amplicons and vice versa depending on the PCR step. Finally, the sequencing technology will also partly determine how many reads we get. Thus, there is no information left in the total read count that can tell us anything about how many bacteria we started out with. If the latter is important, you must do some additional efforts (e.g. qPCR). All the information we get from metabarcoding data lies in the relative abundances, i.e. how much we have of one bacteria relative to the others within the same sample.

In the last paper linked to above, there is a figure (Figure 2) that may give you the impression that with ANCOMBC you can actually estimate the real abundance of the bacteria (microbial load) in the environment. I would say this figure is a bad illustration. What they in the paper name a sample fraction is not what is shown in that figure. The sample fraction estimated and given as output by ANCOMBC is not a fraction (value between 0 and 1) at all, as suggested in that figure, but a sample scaling with values around 1.0, some larger and some smaller than 1.0. Instead, think of this sample scaling as a way to explain the observed differences in library size.

3.7.3 Bias correction

The BC in the ANCOMBC is short for bias correction. What is this bias? In order to understand this, we need to dive into the models behind ANCOMBC. The stuff I now present has been taken from the second paper linked to above, but I use a slightly different notation.

Consider OTU number \(i\) out of a total of \(m\) OTUs, i.e. \(i=1,...,m\). Consider its read counts in sample group \(j=1,2\) i.e. we have two sample groups (1 and 2 instead of A and B). This can easily be extended to more than two sample groups, but for simplicity we stick to this now. Within sample group \(j\) we have \(n_j\) samples, i.e. sample \(jk\) is sample number \(k=1,...n_j\) inside sample group \(j\). For the read count data we may then write \[E(Y_{ijk}) = \gamma_{jk}\theta_{ij}\]

where \(\theta_{ij}\) is the expected abundance of OTU \(i\) in sample group \(j\) and \(\gamma_{jk}\) is some constant. The \(\theta_{ij}\) is the true underlying abundance of OTU number \(i\) in group \(j\), but this is unknown to us. Notice that

  • The sample scaling \(\gamma_{jk}\) is a constant that is specific to each sample. You can think of as a constant that takes as input the expected abundance of the OTU and gives us the read count for this OTU in sample \(jk\). It is of course unknown, and will in reality depend on things like DNA isolation affinity, PCR amplification and sequencing technology. Think of it as some ‘effect’ that causes the differences in library sizes between samples.
  • The abundance \(\theta_{ij}\) is the expected abundance for OTU \(i\) in sample group \(j\). Remember that we assume this is the same for all samples within the same sample group, and the only reason we observe different read counts is partly due to the sample scaling \(\gamma_{jk}\) and also some random noise.

After log-transform we may write this as the basic ANCOMBC model equation: \[y_{ijk} = \delta_{jk} + \mu_{ij} + e_{ijk}\] where \(y_{ijk} = \log(Y_{ijk})\) is the logarithm of the read counts (log-counts). You may think of \(\delta_{jk}=\log(\gamma_{jk})\) as the log-sample-scaling (LSS) and \(\mu_{ij} = \log(\theta_{ij})\) as log-abundance (even if this is not 100% correct).

The error term \(e_{ijk}\) is the random noise that creates variation in the observed data in addition to what can be explained by the other terms on the right hand side. As always, \(E(e_{ijk}) = 0\) so this term only contributes with variance, no change in expected values (bias). It is assumed to have the same variance regardless of OTU (index \(i\)) or samples (indexes \(jk\)).

The expected log-counts are therefore described by \[E(y_{ijk}) = \delta_{jk} + \mu_{ij}\] Note that the \(\mu_{ij}\) are the quantities we are interested in, since these are the true underlying log-abundances for OTU \(i\) in sample group \(j\). Note also that this is the same for all samples in sample group \(j\), but the LSS, \(\delta_{jk}\), may add/subtract to the observed log-counts for each sample \(k\).

Let us now consider some averages. Let \[\bar{y}_{ij.}=\frac{1}{n_j}\sum_{k=1}^{n_j}y_{ijk}\] be the average log-count for OTU \(i\) in sample group \(j\). Notice how we use the bar to indicate an average, and the dot indicates which index we average over. Since the \(y_{ijk}\) are our observed data, we can always compute such averages. From the model equation above it follows that \[E(\bar{y}_{ij.}) = \bar{\delta}_{j.} + \mu_{ij}\] Notice how the right hand side contains the \(\mu_{ij}\) we are interested in, plus the LSS averaged over the samples in sample group \(j\).

Similar, we define \(\bar{y}_{.jk}\) as the average log-count in sample \(jk\), averaged over all OTUs. Then, \[E(\bar{y}_{.jk}) = \delta_{jk} + \bar{\mu}_{.j}\] where the last term is the average abundance, averaged over all OTUs. From the last equation we see that \[\delta_{jk} = E(\bar{y}_{.jk}) - \bar{\mu}_{.j}\] and a least-square estimator for the LSS \(\delta_{jk}\) is \[\hat{\delta}_{jk} = \bar{y}_{.jk} - \bar{y}_{.j.}\] where we simply replace the expected value for \(\bar{y}_{.jk}\) by its actually observed value, and the average log-abundance by the average log-count, averaged over all OTUs and samples inside sample group \(j\).

Next, recall from the model equation above that \(E(y_{ijk}) = \delta_{jk} + \mu_{ij}\), i.e. if we just rearrange \(\mu_{ij} = E(y_{ijk}) - \delta_{jk}\). From this, it seems natural to estimate \(\mu_{ij}\) by \[\hat{\mu}_{ij} = \frac{1}{n_j}\sum_{k=1}^{n_j}(y_{ijk} - \hat{\delta}_{jk})=\frac{1}{n_j}\sum_{k=1}^{n_j}y_{ijk} - \frac{1}{n_j}\sum_{k=1}^{n_j}\hat{\delta}_{jk} = \bar{y}_{ij.}-\bar{\hat{\delta}}_{j.}\] If we then plug in for the estimator \(\hat{\delta}_{jk}\) from above we get \[\hat{\mu}_{ij} = \bar{y}_{ij.} - \bar{\hat{\delta}}_{j.} = \bar{y}_{ij.} - \frac{1}{n_j}\sum_{k=1}^{n_j}(\bar{y}_{.jk}-\bar{y}_{.j.}) = \bar{y}_{ij.} - \frac{1}{n_j}\sum_{k=1}^{n_j}\bar{y}_{.jk} + \frac{1}{n_j}\sum_{k=1}^{n_j}\bar{y}_{.j.} = \bar{y}_{ij.} - \bar{y}_{.j.} + \bar{y}_{.j.} = \bar{y}_{ij.}\] Thus, we estimate the true log-abundance for OTU \(i\) simply by averaging the observed log-counts over all samples in sample group \(j\).

If this was an unbiased estimator, we should now have that \(E(\hat{\mu}_{ij}) = \mu_{ij}\). This is not the case. It is instead \(E(\hat{\mu}_{ij}) = \mu_{ij} - \bar{\delta}_{j.}\). This means that when we specify the null hypothesis for OTU \(i\) \[H_0: \mu_{i1} = \mu_{i2}\] we cannot use the difference in estimates \(\hat{\mu}_{i1} - \hat{\mu}_{i2}\) directly as an indicator if this is true or not because its expected value is \(E(\hat{\mu}_{i1} - \hat{\mu}_{i2}) = \mu_{i1} - \mu_{i2} + \bar{\delta}_{i2} - \bar{\delta}_{i1}\), and the extra term \(\bar{\delta}_{i2} - \bar{\delta}_{i1}\) means that the test statistic in general become larger than it should, and we tend to reject \(H_0\) when we should in fact not (false positives, claiming we have differential abundance when in fact we have not). The rest of the method description is very much about how to estimate and correct for this bias by refining the least squares estimates above, but it is beyond our scope to dig into this.

This bias-correction will in general mean that we get fewer significant OTUs compared to methods ignoring this, like the Wilcoxon test we used directly on \(TSS\) normalized data. In the second paper mentioned above you can read more about this, and see some comparisons to other methods, but it is difficult to trust such comparisons unless they are done by some neutral third party. It is better to try to understand the models and algorithms, and make up your mind if this seams reasonable.

3.7.4 Exercise - the CLR-transformed data in the ANCOMBC setting

In module 11 we saw the \(CLR\) transform for normalizing metabarcoding data. If \(Y_{ijk}\) is the read count for OTU \(i\) in sample \(jk\) as above, let \(z_{ijk}\) be the CLR-transformed data.

Write first how this can be expressed as the log-counts \(y_{ijk}\) and some of its averages.

What is the average CLR-value inside sample \(jk\), i.e. averaged over all OTUs?

Discuss these results in the light of the bias from the ANCOMBC model.

3.7.5 Exercise solution

To \(CLR\) transform we would first need to compute the geometric mean for sample \(jk\): \[g_{jk} = (Y_{1jk}\cdot Y_{2jk}\cdots Y_{mjk})^{ 1/m}\]

Then we would compute our transformed observations as: \[z_{ijk} = \log\left( \frac{Y_{ijk}}{g_{jk}}\right)\]

If we use the rules for logarithms, we get \[z_{ijk} = \log(Y_{ijk}) + \log(g_{jk}) = y_{ijk} - \frac{1}{m}\sum_{i=1}^{m}\log(Y_{ijk}) = y_{ijk} - \bar{y}_{.jk}\] Thus, the \(CLR\) transformed value \(z_{ijk}\) is the log-count \(y_{ijk}\) minus the average log-count for the sample \(jk\), averaged over all OTUs.

The average \(CLR\) value inside sample \(jk\) is then \[\bar{z}_{.jk} = \frac{1}{m}\sum_{i=1}^{m}z_{ijk} = \frac{1}{m}\sum_{i=1}^{m}y_{ijk} - \frac{1}{m}\sum_{i=1}^{m}\bar{y}_{.jk} = \bar{y}_{.jk} - \bar{y}_{.jk} = 0\] Which is why it is called centered, the average \(CLR\)-value in a sample is always zero.

Does this cope with the bias that is the focus of ANCOMBC? The reason for this bias are the LSS \(\delta_{jk}\). From above we see that \[z_{ijk} = y_{ijk} - \bar{y}_{.jk}\] Thus, the \(CLR\) values are the log-counts minus ‘something’ that we may interpret as this bias. From further up we saw that in the ANCOMBC setting the estimate for the LSS was \[\hat{\delta}_{jk} = \bar{y}_{.jk} - \bar{y}_{.j.}\] Thus, if \(\hat{\delta}_{jk} = \bar{y}_{.jk}\) then what we subtract in the \(CLR\) would be the (estimate of) the ANCOMBC bias.

Note that if we do the \(CLR\) transform, we have no grouping of samples, each sample is treated individually. In ANCOMBC we have defined the sample groups, and thereby can also make use of this information, resulting in a slightly different normalization.



3.8 More complex comparisons

In all examples above we compared two groups of samples. In real life we may be interested in more complex comparisons, like

  • Compare more than 2 groups, i.e. group by a factor with 3 or more levels.
  • Combine factors, with 2 or more levels.
  • Add numerical variables in addition to factors.

In the case of the Wilcoxon test, to cope with this we need to replace it with the Kruskal-Wallis test, which is built on a the same idea of using ranks rather than actual observed values. For both the DESeq2and the ANCOMBC these extensions are also possible. We will focus on the ANCOMBC here now.

We can start with extending to a factor with more than 2 levels. Let us simply re-run by using the bmi_category as the grouping variable, but use all data so that this is a factor with 5 levels (this takes some time):

# Assume we have gut.phy from above
ancom_multi.obj = ancombc2(data = gut.phy,
                     fix_formula = "bmi_category",
                     p_adj_method = "fdr",
                     verbose = TRUE)
ancom_multi.tbl <- ancom_multi.obj$res

The resulting table is quite similar to what we saw previously, but has many more columns. Since bmi_category contains 5 different texts, it is converted into a factor with 5 levels. These are ordered alphabetically unless you already converted to a factor yourself, and gave the levels a different ordering. With alphabetical sorting the level lean becomes level 1, and hence the reference level. All the other 4 groups are now compared against this, i.e. the q-values in the column bmi_categoryoverweight now indicates if the abundance in the overweight group is significantly different from in the lean group. This means the ordering of the groups become more important! The first factor level is always treated as the reference, and all others are compared against this. Thus, what we did in the first exercise in this module now becomes important, since this is how you decide what is the reference factor level.

How about if we still wanted to test if there is a difference between overweight and morbidobese? In fact, all the other pairwise comparisons in addition to those where we only compare against lean? We can get some results for these by re-running the analysis, requesting pairwise testing. We will try this, but use only a subset of 3 factor levels (lean, overweight and morbidobese) since these computations take a lot more time:

# Assume we have gut.phy from above
gut_sub3.phy <- subset_samples(gut.phy, bmi_category %in% c("lean", "overweight", "morbidobese"))
ancom_pair.obj = ancombc2(data = gut_sub3.phy,
                     fix_formula = "bmi_category",
                     p_adj_method = "fdr",
                     group = "bmi_category",         # define groups specifically here
                     pairwise = TRUE,                # we want pairwise tests
                     verbose = TRUE)
ancom_pair_res.tbl <- ancom_pair.obj$res
ancom_pair_res_pair.tbl <- ancom_pair.obj$res_pair

Notice how we now need to specify the grouping by the group argument. The fix_formula may be more complex, and this is why we need to specify the grouping specifically here. Also, we set pairwise = TRUE.

If we look into the results table as before (in ancom_pair.obj$res) this is no different, we test against the reference level lean. The extra table in ancom_pair.obj$res_pair contains the results for the additional comparisons as well. In this case the columns with bmi_categoryoverweight_bmi_categorymorbidobese will have the results for comparing groups overweight and morbidobese, none of them being the reference group.

How about a more complex model where we include more than one factor? This is also straightforward in ancombc2. Let us include also the sex column to see if some OTUs differ between male and female as well:

ancom_2fac.obj = ancombc2(data = gut_sub3.phy,
                     fix_formula = "sex+bmi_category",
                     p_adj_method = "fdr",
                     verbose = TRUE)
ancom_2fac_res.tbl <- ancom_2fac.obj$res

Notice how we now changed the fix_formula input, adding two factor variables. This means we look for effects of both sex and bmi_category on the abundances for each OTU. If you are familiar with Analysis-of-variance (ANOVA) this is exactly the same. We specify a linear model where we have main effects of sex and bmi_category.

The results we get are of the same type as we have seen before, but now we test for differential abundance both between female and male as well as between lean and the the other two bmi_categorys. The results will in general differ from if you did this in two separate analyses, first using sex as grouping variable and then bmi_overweight. The reason is that the residuals will in general become smaller now, and you fit the data better than if you did two separate analyses. You can imagine that if we only grouped by the sex variable, then all variation which is due to the three bmi_catgory levels would be seen as ‘random noise’. By including bmi_category in the model as well, the ‘random noise’ is now what is left over when both effects of sex and bmi_category has been subtracted. Less random noise makes the picture clearer, and we get closer to seeing the actual differences in abundances.

You may again also request pairwise testing, but you can only group the data according to one variable. In this case it must be the bmi_category variable, since the sex only 2 groups, and no additional pairs can be tested.

You may also add numerical variables, e.g. the Age column in the current data. You may also add random effects in the model, but since this is all beyond the prerequisites for this course we will not study this here. Also, remember that by adding more and more effects the result you get become more and more difficult to interpret. There is no point in fitting a model that you do not understand.





4 Differential composition

Instead of focusing on the separate OTU or taxa, we may instead be interested in testing if there is any significant difference in the composition between groups of samples. For each sample we have a vector (row) of data indicating the (relative) abundance of the various OTUs in each sample. This is the composition or profile of the microbial community.

When we have data for several samples from the same treatment/condition/site these will typically be somewhat similar to each other, but not identical. The expected composition for that particular treatment/condition/site we may think of as an average of all the composition profiles we could have collected. Imagine we collected infinite many samples from some condition A, and then computed this expected composition. The composition for each sample are then variants of this, due to any kind of ‘noise’ that may be due to the biology itself or our way of sampling/sequencing/processing of the data. If we also did the same for some condition B, we could compare the expected profiles directly, and see if they are identical or not. Any tiny little difference would be enough to state they are not identical.

In reality we do not have these expected profiles, since we have only some, not infinite many, samples. We still want to test if there is a difference in the expected composition between the treatments/conditions/sites, but now based only on the data we have. This means we have to resort to some hypothesis testing and probabilistic statements again.

As always we must be aware of the null-hypothesis we are testing. If you report a p-value, and you do not know what you are actually testing, it becomes meaningless. Assume we have two conditions, A and B. Let \(\bf{\mu}_A\) and \(\bf{\mu}_B\) be the expected compositions for both treatments, respectively. Think of them both as vectors, the ‘average’ samples for the two conditions. Then our null-hypothesis is: \[ H_0: \bf \mu_A = \mu_B\]

The alternative, \(H_1\) is then they are not identical, i.e. any (tiny) difference between them. If this \(H_0\) is true, we expect the composition profiles that we actually observe from condition A and B are quite similar to each other. They need not be identical, we always see some variation in composition profiles even within the same condition. Thus, we need to compare the composition profiles from both conditions and some clever way in order to shed some light on the null-hypothesis above.

4.1 The PERMANOVA and the adonis2 test

The Analysis of Variance (ANOVA) you should have heard of in introductory statistics. In ANOVA we typically have a single response variable of observed data, and then we have one or more factors, i.e. categorical variable with two or more factor levels. These factors are typically something we have designed in some experimental design. A classical example is where we grow crops under various types of fertilizers. Then the response is the amount of crops we harvest in each case and the factor is the variable indicating which fertilizer type (each type is a new category) was used in each case. We have then done repeated experiments, having a number of samples for each category. In ANOVA we then compare variation in response between categories to those within categories, and from this we compute p-values.

The huge difference to our setting is that we do not have a single ‘response’, but rather entire vectors of many responses. If we were interested in the single OTUs, we are back to the differential abundance testing above. So how do we compare entire profiles?

Well, we saw some distance metrics, or beta-diversity measures, in module 11. They do exactly this. Using one of these, we get a distance matrix \(D\) with one row and column for each sample. If the samples from group A are distinctly different from the samples from group B, then we should observe that

  • The distance between samples inside group A or inside B are fairly small…
  • …compared to the distances between samples from A and B

In the test below we compute the between-group squared distances and compare this to the within-group squared distances. This is exactly the same idea as in an ANOVA F-test, except that there we use sum-of-squares instead of squared distances. But a sum-of-squares may be thought of as a squared euclidean distance. In the F-test we assume normal distributed data, and then the sum-of-square F-statistic become F-distributed. Since we now allow any distance metric, and the data may be far from normal, we have no idea of the true distribution of the statistic we compute. We may still compute such test-statistics, but in order to convert it to some p-value, we need to put this statistic into some probability density.

To resolve this the PERMANOVA instead uses a permutation test, which is also the reason for the name of this procedure (permutation based ANOVA). Here is the main idea:

If the null-hypothesis is true, and there is no real difference between group A and B, then samples may be re-assigned randomly (shuffled) to any group, and the between-versus-within statistics should come out more or less the same as before. This is the permutation, we simply re-assign our samples randomly to groups A and B. Some of the samples who actually belong to group A may now be in group B and vice versa. The number of samples in each group remains the same. After each re-assignment we re-compute the between-group and within-group distances, and repeat this many times. This forms a distribution, think of a histogram of these values. If the statistic with the original grouping becomes (much) different from the permuted ones, we conclude group A and B actually have significantly different compositions.

There is one pitfall in this test: If group A (or B) has huge within-group-distances compared to B (or A), then this test may also give us a small p-value, even if the ‘average’ composition is the same. We should therefore always also consider the within-group-distances separately for all groups, to verify these are not wildly different from each other.

This PERMANOVA procedure is implemented in the function adonis2() in the vegan package. The vegan package should already be installed, since it is a dependency for the phyloseq, but if not, install it fram CRAN.

Let us try this to test if we have a significant difference in composition between the overweight and morbidobese samples in our data.

First, we compute the Bray-Curtis distances for the data from above:

library(vegan)
# Assume we have gut.phy from above
gut_sub.phy <- subset_samples(gut.phy, bmi_category %in% c("overweight", "morbidobese"))
d.bc <- phyloseq::distance(gut_sub.phy, method = "bray")

Any distance metric may be used here, our choice of the Bray-Curtis is rather random.

Digression: Note the syntax phyloseq::distance() above. Functions named distance() is found in several R packages, and when we loaded the DESeq2 package above, the phyloseq version of this function was replaced by another. By writing phyloseq::distance we specify that we now want the function distance() from the phyloseq package, and nothing else. You may sometimes need to specify this in R code, if you load many packages.

Next, we need the sample_data from our phyloseq object, this is for the grouping information only (the bmi_category column).

sample.tbl <- sample_data(gut_sub.phy) %>% 
  as.matrix() %>% 
  as.data.frame()

Then we can perform the test, using the function adonis2() in the vegan package:

tst <- adonis2(d.bc ~ sample.tbl$bmi_category)
print(tst)
## Permutation test for adonis under reduced model
## Terms added sequentially (first to last)
## Permutation: free
## Number of permutations: 999
## 
## adonis2(formula = d.bc ~ sample.tbl$bmi_category)
##                          Df SumOfSqs      R2      F Pr(>F)    
## sample.tbl$bmi_category   1    1.489 0.04049 9.0306  0.001 ***
## Residual                214   35.296 0.95951                  
## Total                   215   36.786 1.00000                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Let us have a brief look at the results in the output table. Below I use the notation found in the original article describing the PERMANOVA test, see https://onlinelibrary.wiley.com/doi/10.1111/j.1442-9993.2001.01070.pp.x.

First, we compute the total sum of squared distances, i.e. we use all distances. This we denote \(SS_T\), and in the output above this has the value 36.786 (last value under SumOfSqs). The within squared distances, \(SS_W\), is 35.296 (second value under SumOfSqs). This results in the between group sum of squared distances is \(SS_A = SS_T - SS_W\), which is 1.489 here (first value under SumOfSqs). Does this mean the spread is larger within than between? No, we have to divide by the degrees of freedom as well. Since we have 2 groups, the number of parameters is \(a=2\). We also have \(N=216\) observations (samples). The degrees of freedom are listed under Df in the output. This means we get the test statistic \[F = \frac{SS_A / (a-1)}{SS_W / (N-a)} = \frac{1.489/1}{35.296/214} = 9.0306\] We call the test statistic \(F\) even if it is in general not F-distributed. It is still e measure of how large the between-variation (numerator) is compared to the within-variation (denominator). To get a p-value, we need to know how big the F-values will usually get when \(H_0\) is true. To unravel this we now make 999 permutations to get 999 new F-values, and together with our original one, we now have 1000 such F-values. The p-value is then simply the fraction of these who are at least as large as the original one at 9.0306. Here the p-value (under Pr(>F)) becomes 0.001, which means the only one among the 1000 F-values who was at least as large as the original was the original itself. None of the permuted values were larger. This is the smallest p-value we can get with 999 permutations. You may try more permutations if you like, just to see if any permuted F-value could ever become larger than the original at 9.0306.

From the small p-value we may conclude that the groups differ in composition.

However, before we conclude, we should inspect the within-group-distances, since we know that if the groups differ much in variation, this may effect the outcome of this test. The vegan package also has a function for computing the spread (dispersion) within each group, the betadisper():

within.disp <- betadisper(d.bc, sample.tbl$bmi_category)
print(within.disp)
## 
##  Homogeneity of multivariate dispersions
## 
## Call: betadisper(d = d.bc, group = sample.tbl$bmi_category)
## 
## No. of Positive Eigenvalues: 94
## No. of Negative Eigenvalues: 118
## 
## Average distance to median:
## morbidobese  overweight 
##      0.4202      0.3882 
## 
## Eigenvalues for PCoA axes:
## (Showing 8 of 212 eigenvalues)
## PCoA1 PCoA2 PCoA3 PCoA4 PCoA5 PCoA6 PCoA7 PCoA8 
## 8.387 5.600 4.177 3.568 2.978 1.741 1.204 1.054

We notice the ‘Average distance to median’ is quite similar for both groups. To actually test if this is a significant difference we may use

tst.w <- permutest(within.disp)
print(tst.w)
## 
## Permutation test for homogeneity of multivariate dispersions
## Permutation: free
## Number of permutations: 999
## 
## Response: Distances
##            Df  Sum Sq  Mean Sq      F N.Perm Pr(>F)
## Groups      1 0.02016 0.020164 1.8431    999  0.165
## Residuals 214 2.34110 0.010940

And we notice the p-value is large. Groups overweight and morbidobese have comparable within-group-distances, and the reason for the small p-value in the adonis2() test must be due to a significant difference in composition between these groups.



4.2 More complex models

Again we may want to extend our analysis to comparing more than 2 groups and/or combine several factor variables. This is straightforward, by extending the formula we use to specify the model in the adonis2() function.

4.2.1 Exercise - test for compositional changes between all bmi_categorys and sex

Include all data from gut.phy, and compute the bray-Curtis distances in this case. Then, run the adonis2() to test for difference in composition for both sex and bmi_category. As for ANCOMBC you simply add the two factor variables in the formula.

There is no pairwise testing option here. Discuss what it means if the variable bmi_category (having 5 levels or groups) comes out as significant (small p-value)?

4.2.2 Exercise solution

d5.bc <- phyloseq::distance(gut.phy, method = "bray")
sample.tbl <- sample_data(gut.phy) %>%
  as.matrix() %>%
  as.data.frame()
tst <- adonis2(d5.bc ~ sample.tbl$bmi_category + sample.tbl$sex)
print(tst)
## Permutation test for adonis under reduced model
## Terms added sequentially (first to last)
## Permutation: free
## Number of permutations: 999
## 
## adonis2(formula = d5.bc ~ sample.tbl$bmi_category + sample.tbl$sex)
##                           Df SumOfSqs      R2      F Pr(>F)    
## sample.tbl$bmi_category    5    5.454 0.03157 6.7456  0.001 ***
## sample.tbl$sex             1    0.907 0.00525 5.6108  0.001 ***
## Residual                1029  166.395 0.96318                  
## Total                   1035  172.756 1.00000                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

A small p-value for a multi-category variable simply means at least one of the groups have a composition differing from at least one other group. It says nothing about which group differ from which. To get this information, you need to test them against each other by sub-setting the data and do the pairwise analysis we started out with.