1 Learning goals

  • Learn how to make use of the phyloseq R package for microbiota data analysis
  • Be able to calculate different diversity measures from microbiota data (Alpha and Beta diversity), understand the differences and what information they can provide.
  • Be able to visualize the data in interesting plots: Stacked barplot of abundance and standard ordination plots and interpret these in light of the problem.
  • Understand some basic normalizations of microbiota data





2 The phyloseq R package

In the previous module we saw how we can get from a set of reads from many samples of amplicon data, to some read count matrix with accompanying data tables. In this module we will have a look at how we can get meaningful and interpretable information out of such a read count data.

In this module we will look at some diversity measures for the data and also how we can visualize these in such a way that it gives us some interesting information about the problem. We will throughout the module look at the results both from dada2 and vsearch. These should basically reflect the “world”. If they differ, this is method dependent and not population dependent. Which is more correct? We leave the question open for reflection.

The analysis of read count table data is common, and this has resulted in a set of ‘standard’ approaches. This has also spawned software solutions (pipelines) where the exact same set of results/figures are computed each time. It is always a balance here, between making science reproducible and mundane. We will go thorough one very commonly used package but emphasizes the fact that we only cover a small part of what can be done with count data and that it is the aim of study and not a pipline that should determine what should be included next time you do this.

We will have a look at the phyloseq package in R. This is also a software solution for ‘standardizing’ the analysis of such data, but since it is within the R language universe it is possible, and sometimes necessary, to go beyond the tools provided by this package in order to obtain the desired results. Keep in mind that learning how to ‘speak the R language’ is way much more important then simply learning some standard use of this package (or any other package).

In the next module we will run some analysis on the same data to substantiate what we found this week. It involves statistics at a fairly advanced level, and could very well have been a separate course on its own. We will only focus on the most generic parts of it.

2.1 Installing the phyloseq package

First, we need to install the phyloseq package. We can find specifications for installation on the Bioconductor homepage Phyloseq. From this page we see that the following code will install Phyloseq. Remember, do this only once! It can take some time so be patient:

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("phyloseq")

NOTE: When asked if you want to update packages, please answer NONE (n) to this! If not, some of the packages will then be attempted re-installed for a version that these servers for some reason cannot install, and you get some hassle.

2.2 Creating the phyloseq object

From Module 10 we have several files from both dada2 and vsearch. For both we have:

Three tables called readcounts.txt, sample_table.txt and taxonomy.txt, with either dada2 or vsearch as prefix. We need to convert them to “fit” to a phyloseq object. In the phyloseq package there are a range of desirable function designed to operate specifically on such phyloseq objects. To begin with we need to design these three tables to specific formats with row- and column names.

  • A readcount.dta, must be a numeric matrix with one row for each sequence cluster (OTU or ASV) and one column for each sample. Row names and column names are identifiers for clusters/samples, respectively.
  • A sample.dta, must be a data frame with one row for each sample. The row names sample_id contains the sample identifiers that links it to the column names of readcount.txt.
  • A taxon.dta, must be a character matrix with one row for each cluster. The row names must match the OTU names in readcount.dta.

Thus, by converting our txt files to either matrix or data frame, we may use some of the pre made functions in the phyloseq package. Let us construct a phyloseq object from the the dada2 output in the last module:

library(tidyverse)
library(phyloseq)

### We load results from module 10

# OTU/ASV table
readcount <-  read_delim("/mnt/courses/BIN310/module11/dada2_readcounts.txt") # Reading the txt file.

### We need to convert the readcount to a matrix and the first row to row names instead of a column
readcount.dta <- data.matrix(readcount[,-1])
rownames(readcount.dta) <- readcount$ASV


# Sample table
sampletable <- read_delim("/mnt/courses/BIN310/module11/sample_table.txt")


### We need to convert the first row to row names instead
sample.dta <- data.frame(sampletable[,-1], row.names=sampletable$sample_id)

# Taxonomy table
taxonomy <- read_delim("/mnt/courses/BIN310/module11/dada2_taxonomy.txt")

### We need to convert taxonomy to a matrix only contain taxonomy and use Header as row names:
taxonomy %>% 
  select(-Header, -Sequence) %>% 
  as.matrix() -> taxon.dta
rownames(taxon.dta) <- taxonomy$Header 

# Now, lets create the phyloseq object
physeq.obj <- phyloseq(otu_table(readcount.dta, taxa_are_rows = T),
                       sample_data(sample.dta),
                       tax_table(taxon.dta))
print(physeq.obj)
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 33 taxa and 15 samples ]
## sample_data() Sample Data:       [ 15 samples by 4 sample variables ]
## tax_table()   Taxonomy Table:    [ 33 taxa by 7 taxonomic ranks ]

If something went wrong, here is some details to check:

In the phyloseq terminology the read count matrix is called an otu_table, and we use the function otu_table() to create this. This must be a numeric matrix (not a table!) with read counts, and the name otu_table is at best confusing. Note that we have to indicate if taxa are in the rows or in the columns, i.e. both configurations are allowed.

Next, the table with sample data, and the function for creating it, is named sample_data in phyloseq. This must be defined as a data frame, and our sampletable can be used, with one modification. The sample identifiers we have in the sample_id column must be row names of this table. We therefore create sample.dta which has correct row names and is defined as a data.frame.

Finally the tax_table is also a matrix(!) and must have the cluster identifiers in the row names. It should only contain the taxonomy, not the sequences like we have in our taxonomy.

There is one more element we may add to our physeq.obj. For some analyses we need a phylogenetic tree based on the cluster sequences. This must be an object of the type phylo from the ape package. We may come back to this in the next module.

We have now managed to get our data as a phyloseq object and can use functions within this package. Try to write help(package = "phyloseq") in the Console to get an overview. You are now welcome to play around with this functions. Let us start by plotting and summarizing data.



2.3 Simple plots

Barplots is a nice way to visualize community data, and of course the phyloseq have a function for this:

p.bar <- plot_bar(physeq.obj, fill = "Genus")
print(p.bar)

Notice that bars have different heights, since we now base this on read counts, not relative abundances. The ggplot object returned we may manipulate by adding more layers:

p.bar2 <- p.bar + facet_grid(~Phylum)
print(p.bar2)



2.4 Subsetting and summarizing the data

There are some phyloseq functions for sub-setting and summarize the data.

Here is how we can sum the total read count for each OTU:

total.abundance <- taxa_sums(physeq.obj)

It is simply rowSums() of the otu_table (which is a matrix…).

We can subset the data into another phyloseq object. Here we use prune_taxa() to select only a subset of the OTUs. We need the identifiers for the OTUs we want to keep. In this case we keep the 10 most abundant ones:

most.abundant.10 <- names(sort(total.abundance, decreasing = T))[1:10]
physeq.obj.top10 <- prune_taxa(most.abundant.10, physeq.obj)

print(physeq.obj.top10)
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 10 taxa and 15 samples ]
## sample_data() Sample Data:       [ 15 samples by 4 sample variables ]
## tax_table()   Taxonomy Table:    [ 10 taxa by 7 taxonomic ranks ]

prune_taxa simply subset the OTU, for a given argument, in our case the most abundant OTUs. There is also a subset_taxa() function in phyloseq, doing exactly what it says: Subsetting the data based on taxonomic input:

physeq.obj.bacteroides <- subset_taxa(physeq.obj, Genus == "Bacteroides")

There is also a prune_samples() for subsetting the samples instead of the OTUs. Note that all parts of the data structure (physeq.obj) is affected by this subsetting. It should be noted that if you first run a psmelt() (see further down in the document) and then used the standard filter() function, you could do similar subsettings. However, the phyloseq data structure is needed for some analyses, and the subsetting of the melted table is only really useful for plotting.

The agglomeration of read counts means we merge and sum read counts from several OTUs. The function tax_glom() may be used to merge all taxa in the same clade to a single taxon, e.g. merge all genera under the same family. Their read counts are then summed:

physeq.obj.family <- tax_glom(physeq.obj, taxrank = "Family")

Inspect the otu_table and tax_table for this new object. You need to name a rank in the taxonomy where you want the agglomeration.

Agglomerations are useful. Sometimes we want to make our analyses at a higher taxonomic rank than the lowest (Genus), and use tax_glom(). Also, if you merge several data sets (see merge_phyloseq()), the OTUs centroid sequences will in general differ slightly, even if they actually represent the same taxon. This has to do with the clustering method used. Then, by merging very close OTUs, using tip_glom() and a small value for h, you may get a more relevant set of OTUs for the community.



2.5 More on phyloseq

We only briefly touch upon all possibilities in phyloseq here. Search the net for more information. Here are a couple of places to start:



2.6 Exercises

2.6.1 Exercise - phyloseq object for vsearch

Use the code above and the files from vsearch from last module to make a new phyloseq object of the vsearch results, call it physeq.objV. You find the files in the same folder as the dada2 files we read above.

Examine the two different phyloseq objects, what is different and what is similar?

2.6.2 Exercise solution

### We load results from module 10 but switching dada2 with vsearch

# OTU/ASV table
readcount <- read_delim("/mnt/courses/BIN310/module11/vsearch_readcounts.txt") %>% 
  rename(OTU = `#OTU ID`)

### We need to convert the readcount to a matrix and the first row to row names instead of a column
readcount.dta <- data.matrix(readcount[,-1])
rownames(readcount.dta) <- readcount$OTU


# Sample table - is the same as for dada2 and we use this directly

# Taxonomy table
taxonomy <- read_delim("/mnt/courses/BIN310/module11/vsearch_taxonomy.txt")

### We need to convert taxonomy to a matrix only contain taxonomy and use Header as row names:
taxonomy %>%
  select(-Header, -Sequence) %>%
  as.matrix() -> taxon.dta
rownames(taxon.dta) <- readcount$OTU #readcount column one has the right names

# Now, lets create the phyloseq object
physeq.objV <- phyloseq(otu_table(readcount.dta, taxa_are_rows = T),
                       sample_data(sample.dta),
                       tax_table(taxon.dta))
print(physeq.objV)
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 25 taxa and 15 samples ]
## sample_data() Sample Data:       [ 15 samples by 4 sample variables ]
## tax_table()   Taxonomy Table:    [ 25 taxa by 7 taxonomic ranks ]

2.6.3 Exercise - subset and plot for vsearch

Make a barplots for the phyloseq object from vsearch, physeq.objV, for all genera and then only for the five most abundant OTUs.

2.6.4 Exercise solution

p.barV <- plot_bar(physeq.objV, fill = "genus") #In the tax.file genus is now written with lower case "g"

total.abundance <- taxa_sums(physeq.objV) #calculate abundance

most.abundant.5 <- names(sort(total.abundance, decreasing = T))[1:5] #use the 5 top abundant OTU
physeq.obj.top5 <- prune_taxa(most.abundant.5, physeq.objV) #prune the phyloseq object

p.barV5 <- plot_bar(physeq.obj.top5, fill = "genus") #In the tax.file genus is now written with lower case "g"
print(p.barV5)

2.6.5 Exercise - several plot in the same figure

You can have multiple figures side by side in RMarkdown by specifying fig.show="hold" along with the out.width option. Set out.width="50% and plot the two barplots, not pruned, without splitting on phylum, side by side for easier comparison. In plain RStudio you can use the par() function for plots in same figure.

Compare and discuss with a fellow student or yourself the differences and why it is not the same.

2.6.6 Exercise solution

p.bar + ggtitle("dada2")
p.barV + ggtitle("vsearch")





3 Analysis of data

We need to keep in mind that the picture we look at here only is a sample of the true microbial community. We are unable, both physical and technical to obtain all taxa and their composition in one environment. This is only the case if we grow the bacteria in a closed environment at the laboratory. We sample from an bigger environment and then we use these samples to “guess something” about the true environment. Hence, the samples are not of particular interest, except that they (hopefully) reflect the composition from the environment sampled. This is equivalent with the theory from any statistical course.

We will now look at some measures that we can be used for analysis of the data. These measures are usually used to reflect the environment the samples comes from and draw conclusions, but keep in mind, it is only a sample depending of both sampling, the technology that sequenced it and all the the pre-steps we have done to come here.

In this module we will consider two diversities measures. I. Alpha diversity - diversity within a sample and II. Beta diversity - distance between samples.

3.1 Alpha diversity - diversity within a sample

The alpha diversity of a sample indicates how large the diversity is within the sample. A large diversity means it contains many different taxa, or conversely, a small diversity means the sample is completely dominated by a very few taxa. There is a range of statistics we may compute for indicating diversity. Here we plot three of them for both our phyloseq objects:

p.div <- plot_richness(physeq.obj, color = "condition", measures = c("Observed","Shannon","Simpson"))
## Warning in estimate_richness(physeq, split = TRUE, measures = measures): The data you have provided does not have
## any singletons. This is highly suspicious. Results of richness
## estimates (for example) are probably unreliable, or wrong, if you have already
## trimmed low-abundance taxa from the data.
## 
## We recommended that you find the un-trimmed data and retry.
print(p.div)

  • Shannon index reflects how many different taxa there are and how evenly they are distributed within a sample.
  • Simpson index Measures the relative abundance of species and gives more weight to common or dominant species.

The methods implemented here are from the vegan R package, and give a warning if we have no singeltons. A singelton is a case where a taxon is observed once, i.e. 1 read. The vegan package was never designed for read count data, but for ecological studies where the taxa are species, and every ‘read’ is one observation of that species (walk in the forest and observe species). In such tables the number of observations are in general low, with singeltons being common. Have a look at the Help for the function diversity() from the vegan package for more details.

All the plots generated by the phyloseq package are based on ggplot and returns such objects. This means we may manipulate them further using functions from the ggplot2 package.

More on alpha-diversity here:



3.2 Beta-diversity - distance between samples

In ecology a beta-diversity reflects the difference in composition between communities. We may also think of it as a distance, even if not all beta-diversity statistics are strictly metrics in the mathematical sense. How different is one sample from another? This is the type of question we try to shed some light on by computing beta diversity statistics.

Consider two samples from our read count matrix, \(Y_{*j}\) and \(Y_{*k}\).

3.2.1 Minkowski distances

The Manhattan distance between two compositions is as follow:

\[Mnt(j,k) = \sum_{i=1}^n |Y_{i,j} - Y_{i,k}|\] but slightly more common is the euclidean distance

\[Ecd(j,k) = \sqrt{\sum_{i=1}^n |Y_{i,j} - Y_{i,k}|^2}\] The only difference is the squaring and then square root. Manhattan distance is the sum of difference between the x- and y- coordinate and has its name from the streets of Manhattan i New York City, while euclidean distance is what we think of as distance in daily life (length of straight line between two points).

Illustation:



Both are actually special cases of the more general minkowski distance formula: \[Mnk(j,k) = (\sum_{i=1}^n |Y_{i,j} - Y_{i,k}|^q)^{1/q}\] By choosing \(q=1\) you get manhattan and \(q=2\) euclidean distances. All the minkowski variants will result in distances who are somewhat similar to each other, and we typically stick to the euclidean representative, unless you have some good reason not to.

If you have CLR transformed data, the euclidean distance is a natural choice of distance between samples.

3.2.2 Other distances

There is a long list of distance or distance-like statistics that are used for read count data.

For example Unifrac distances but in order to compute such distances, you have to have a phylogenetic tree describing how all OTUs are related, which we have not used in this modul.

The Bray-Curtis is not really a distance, since it does not fulfill all mathematical requirements, but is still used a lot as a measure of dissimilarity. Here is the formula: \[Bcu(j,k) = \frac{\sum_{i=1}^n |Y_{i,j} - Y_{i,k}|}{\sum_{i=1}^n Y_{i,j} + Y_{i,k}}\]

Notice the numerator is the Manhattan distance we saw above. Here we divide it by the sum of all abundances in the two samples. This means the Bray-Curtis gives values between \(0.0\) (samples are identical) and \(1.0\) (consists of completely different OTUs). Notice that since we sum the differences in abundances, it is a good idea to normalize the data prior to a Bray-Curtis computation, see Normalization for more input here.

3.2.3 Exercise - distances in phyloseq

The phyloseq package has a wrapper function for computing a range of distance matrices between all pairs of samples. It typically works like this:

dist.obj <- distance(<phyloseq object>, method = <"name of method">)

The returned dist.obj is a standard distance object in R that store only the lower triangle of the distance matrix. We may convert it into a distance matrix by as.matrix(dist.obj).

Compute the euclidean distance and the manhattan distance between samples both for dada2 and vsearch, using method = "euclidean" and method = "manhattan" with physeq.obj and physeq.objV (non-normalized data), respectively. Store in different dist-objects. Then make a hierarchical clustering based on the computed distance object, and plot, like this:

plot(hclust(<dist.obj>), hang = -1)

We expect the samples from the three conditions (A, B and C) to form distinct groups. Do they?

3.2.4 Exercise solution

Ecd <- distance(physeq.obj, method = "euclidean")
#plot(hclust(Ecd), hang = -1)

EcdV <- distance(physeq.objV, method = "euclidean")
#plot(hclust(EcdV), hang = -1)

Md <- distance(physeq.obj, method = "manhattan")
#plot(hclust(Md), hang = -1)

MdV <- distance(physeq.objV, method = "manhattan")
#plot(hclust(MdV), hang = -1)



3.3 Ordination

The distances from above are most commonly used when doing some ordination of the data. This means plotting the samples as points in a two-dimensional space (x-y-diagram), where we get some information how similar/different they are compared to each other. Think of each sample as vector of coordinates in some high-dimensional space, with one axis (coordinate) for each OTU. By ordination we try to ‘compress’ this into a low (=2) dimensional space, and still preserve as much of the information as possible. Such plots will in general not show all relations between the samples, but hopefully enough for us to see some groupings or gradients. Ordination is by nature exploratory.

You may be familiar with Principal Component Analysis (PCA) from elsewhere. This is an example of some ordination. If we want to make a PCA-plot of our samples, we would take the transposed (samples in rows, OTUs in columns) otu_table and feed this to prcomp(). Then we would plot the samples as dots along the two first principal axes. If you have used the CLR normalization from above, a PCA is a natural way to do ordination of the data. Let us now look at some alternatives to PCA.

3.3.1 Multidimensional scaling

For read count table data, the Principal Coordinate Analysis (PCoA) is popular. PCoA is also the same as multidimensional scaling (MDS). What is this? A major difference to PCA is this:

  • A PCA is computed from the data table itself, samples in rows and OTUs (variables) in columns.
  • A PCoA is computed from a distance matrix, that has been computed from the data table.

Imagine we have measured the flight distances between the cities Oslo, New York, Durban and Sydney. Here is the distance matrix (numbers in km):

cn <- c("Oslo", "New York", "Durban", "Sydney")
D <- matrix(c(0, 5914, 10153, 15950,
              5914, 0, 13308, 15989,
              10153, 13308, 0, 10546,
              15950, 15989, 10546, 0), nrow = 4, byrow = T)
colnames(D) <- cn
rownames(D) <- cn
print(D)
##           Oslo New York Durban Sydney
## Oslo         0     5914  10153  15950
## New York  5914        0  13308  15989
## Durban   10153    13308      0  10546
## Sydney   15950    15989  10546      0

Unless you belong to the flat-earth society, you realize these are distances measured along the outside of a sphere. Still, is it possible to display, in a two-dimensional plane, how these cities are located relative to each other? Thus, we want to

  • Place the cities into a flat two-dimensional (x-y-axis) diagram.
  • Locate them such that the (euclidean) distances that we actually see are as close as possible to the real distances between all pairs.

We realize it may not be possible to re-construct all distances in this flat diagram, but we should try to get as close as possible. This is exactly what MDS is all about. Here is some R code for doing this in the classical way:

X <- cmdscale(D)
as_tibble(X, rownames = "City") %>% 
  ggplot() +
  geom_text(aes(x = V1, y = V2, label = City))
## Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if
## `.name_repair` is omitted as of tibble 2.0.0.
## ℹ Using compatibility `.name_repair`.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

You may find the cities have been placed a little strange compared to how we are used to think about these cities’ location? The axes have no meaning, in terms of north-south or east-west, it is only the relative location between cities we can re-construct. How well are the distances actually re-constructed here? Here is the answer:

D.hat <- as.matrix(round(dist(X)))
print(D.hat)
##           Oslo New York Durban Sydney
## Oslo         0     5284   9849  15916
## New York  5284        0  13307  15907
## Durban    9849    13307      0  10449
## Sydney   15916    15907  10449      0

3.3.2 MDS in phyloseq

Let us make a PCoA plot of our raw data, using the the Bray-Curtis dissimilarity. First we compute the distances and then the ordination object:

bc <- distance(physeq.obj, method = "bray")
ord.obj <- ordinate(physeq.obj, method = "PCoA", distance = bc)

The first plot we make should always be a scree plot. This indicates how much of the total distance information we may capture along axis 1, 2,… up to the number of samples. We are typically only plotting the two first axes, but it is very important to get some information about how much we do not see when only looking through the ‘window’. Here is the result:

plot_scree(ord.obj)

Most of the information lies in the two first axis, but not everything. The steeper this is, the more information we get by plotting it in two dimensions.

Lets try to plot the ordination plot and see after groupings:

plot_ordination(physeq.obj, ord.obj, color = "condition")

From this plot we conclude that when measured by Bray-Curtis, the samples group into three distinct groups by "condition", some exceptions is it for Condition A and B, telling us there are something unexplained that is not condition. This mean the three groups of samples have somewhat similar composition within, and very different between the conditions.

3.3.3 NMDS

A version of MDS that is popular for read count data is the Non-metric MDS (NMDS). What is the difference to the metric MDS?

Here is a short video on YouTube that explains some major ideas of NMDS

An important concept for any MDS is the stress. When finding the new coordinates of our samples in the x-y-plane, (N)MDS does so by minimizing this stress. We start out with the original \(p\times p\) distance matrix \(D\), where \(p\) is the number of samples. Each sample should become point in our new x-y-diagram, and we start by placing them randomly. Let \(X\) be the \(p\times 2\) dimensional matrix of these coordinates. Then we can compute the euclidean distance matrix \(\hat{D}\) from these locations of the samples. We now want to shuffle the points around (change values in \(X\)) such that \(\hat{D}\) becomes as similar as possible to \(D\). What is meant by ‘as similar as possible’? This is where the stress come in.

In the metric MDS a stress function is typical \[Stress(X) = \frac{\sum_{i=1}^p \sum_{j=1}^p (d_{ij} - \hat{d}_{ij})^2}{\sum_{i=1}^p \sum_{j=1}^p d_{ij}^2}\]

where \(d_{ij}\) is cell in row \(i\) and column \(j\) of \(D\), and similar for \(\hat{D}\). The stress is simply the sum of squared differences in distances, divided by the sum of squared distances.

In the non-metric version this stress-function is changed to \[Stress(X) = \frac{\sum_{i=1}^p \sum_{j=1}^p (d_{ij} - f(\hat{d}_{ij}))^2}{\sum_{i=1}^p \sum_{j=1}^p d_{ij}^2}\] where the function \(f(\hat{d}_{ij})\) is a monotonic mapping of the real distances against the observed. In reality this means we place more emphasis on the ordering of the distances, i.e. those distances who are the smallest (largest) in the original \(D\) should also be the smallest (largest) in \(\hat{D}\). If this is the case we are happy, regardless if some of the (extreme) distances may deviate a lot from each other. An NMDS will typically be less sensitive to the extreme distances, in both directions.

Once we have found our best \(\hat{D}\) we compute the final stress. This should ideally be zero, and the larger it gets, the poorer we managed to approximate the original \(D\) with our \(\hat{D}\). A rough rule of thumb says that if this stress is above \(0.20\) we cannot trust the ordination plot to show us the true relations between the samples.

Let us make the NMDS version of the PCoA we did above:

set.seed(3)  # set seed for random generator
nmds.obj <- ordinate(physeq.obj, method = "NMDS", distance = bc)
## Run 0 stress 0.05569319 
## Run 1 stress 0.05569296 
## ... New best solution
## ... Procrustes: rmse 0.00016019  max resid 0.0003688632 
## ... Similar to previous best
## Run 2 stress 0.06893738 
## Run 3 stress 0.09056592 
## Run 4 stress 0.06893739 
## Run 5 stress 0.05569303 
## ... Procrustes: rmse 0.0007301191  max resid 0.001689554 
## ... Similar to previous best
## Run 6 stress 0.09056588 
## Run 7 stress 0.09056609 
## Run 8 stress 0.1239057 
## Run 9 stress 0.1391017 
## Run 10 stress 0.06070222 
## Run 11 stress 0.05569287 
## ... New best solution
## ... Procrustes: rmse 0.0005064185  max resid 0.001171525 
## ... Similar to previous best
## Run 12 stress 0.05569327 
## ... Procrustes: rmse 0.0003912291  max resid 0.0009035739 
## ... Similar to previous best
## Run 13 stress 0.05569326 
## ... Procrustes: rmse 0.0003843763  max resid 0.0008882769 
## ... Similar to previous best
## Run 14 stress 0.06893739 
## Run 15 stress 0.055693 
## ... Procrustes: rmse 0.0005406105  max resid 0.001250971 
## ... Similar to previous best
## Run 16 stress 0.09056591 
## Run 17 stress 0.05569297 
## ... Procrustes: rmse 0.000166026  max resid 0.0003817002 
## ... Similar to previous best
## Run 18 stress 0.09056597 
## Run 19 stress 0.0556929 
## ... Procrustes: rmse 0.0004404844  max resid 0.001019052 
## ... Similar to previous best
## Run 20 stress 0.05569297 
## ... Procrustes: rmse 0.0001658132  max resid 0.0003794088 
## ... Similar to previous best
## *** Best solution repeated 7 times

Notice from the output that there is a search for the best positioning of the samples in the new x-y-space. After some search, the algorithm should hopefully converge on a solution. The NMDS fitting does not always converge, which is often a bad sign. In this case, however, it is because the stress become too low! It simply means the data are too ‘clean’. When we make the plot:

pp_nmds <- plot_ordination(physeq.obj, nmds.obj, color = "condition",
                      title = str_c("NMDS, stress=", format(nmds.obj$stress, digits = 3)))
pp_nmds + geom_point(alpha = 0.2, size = 10)

…it paints quite the same picture as above, only “flipped”. Condition C clearly group together, with low within variation. While Condition A and B both have some observations that clearly have somewhat different compositions. Look over the results we have seen earlier in the modul, are you able to recognize which ones this might be? The stress becomes low here, which is not surprising given the scree-plot we made before. A lack of convergence is something you should always look out for, even if most real life data sets will not have the problems of too low stress as we saw here (more commonly the opposite).

Remember to always indicate the stress-value in an NMDS plot, since this piece of information is needed to know how reliable the locations are.

3.3.4 Exercise - Diversities - comparison of plots

Use the steps above, to calculate beta-diversities for the vsearch results with Bray-Curtis dissimilarity, for comparison with the dada2 results.

3.3.5 Exercise solution

#Beta-diversities:
bc <- distance(physeq.objV, method = "bray")

#PCoA
ord.obj <- ordinate(physeq.objV, method = "PCoA", distance = bc)

scree <- plot_scree(ord.obj) #Screeplot
print(scree)

pcoa <- plot_ordination(physeq.obj, ord.obj, color = "condition")
print(pcoa)

#NMDS
set.seed(3)  # set seed for random generator
nmds.obj <- ordinate(physeq.objV, method = "NMDS", distance = bc)
## Run 0 stress 0.03824316 
## Run 1 stress 0.08732245 
## Run 2 stress 0.03824314 
## ... New best solution
## ... Procrustes: rmse 4.217962e-05  max resid 0.0001087576 
## ... Similar to previous best
## Run 3 stress 0.03824316 
## ... Procrustes: rmse 5.503178e-05  max resid 0.0001424725 
## ... Similar to previous best
## Run 4 stress 0.03824314 
## ... Procrustes: rmse 5.628404e-06  max resid 1.073182e-05 
## ... Similar to previous best
## Run 5 stress 0.09977718 
## Run 6 stress 0.03824314 
## ... Procrustes: rmse 1.886513e-06  max resid 4.007864e-06 
## ... Similar to previous best
## Run 7 stress 0.03824314 
## ... Procrustes: rmse 4.197161e-06  max resid 1.068885e-05 
## ... Similar to previous best
## Run 8 stress 0.03824315 
## ... Procrustes: rmse 2.505667e-05  max resid 6.42225e-05 
## ... Similar to previous best
## Run 9 stress 0.03824314 
## ... Procrustes: rmse 1.859144e-05  max resid 4.763153e-05 
## ... Similar to previous best
## Run 10 stress 0.03824314 
## ... Procrustes: rmse 7.073933e-06  max resid 1.826221e-05 
## ... Similar to previous best
## Run 11 stress 0.03824315 
## ... Procrustes: rmse 2.010389e-05  max resid 5.062851e-05 
## ... Similar to previous best
## Run 12 stress 0.03824316 
## ... Procrustes: rmse 4.663333e-05  max resid 0.0001197091 
## ... Similar to previous best
## Run 13 stress 0.03824314 
## ... Procrustes: rmse 8.496855e-06  max resid 2.149958e-05 
## ... Similar to previous best
## Run 14 stress 0.03824314 
## ... Procrustes: rmse 1.701685e-05  max resid 4.378684e-05 
## ... Similar to previous best
## Run 15 stress 0.03824314 
## ... Procrustes: rmse 2.961512e-06  max resid 6.496092e-06 
## ... Similar to previous best
## Run 16 stress 0.03824315 
## ... Procrustes: rmse 2.753349e-05  max resid 7.151864e-05 
## ... Similar to previous best
## Run 17 stress 0.03824314 
## ... Procrustes: rmse 9.23827e-06  max resid 2.359375e-05 
## ... Similar to previous best
## Run 18 stress 0.03824314 
## ... Procrustes: rmse 9.383536e-06  max resid 2.337688e-05 
## ... Similar to previous best
## Run 19 stress 0.08732186 
## Run 20 stress 0.1053096 
## *** Best solution repeated 16 times
pp_nmdsV <- plot_ordination(physeq.objV, nmds.obj, color = "condition",
                      title = str_c("NMDS, stress=", format(nmds.obj$stress, digits = 3)))
pp_nmdsV + geom_point(alpha = 0.2, size = 10)

3.3.6 Exercise - directly compare distances between method

We have data processed in two ways, either with dada2 or vsearch. How does this influence the data, e.g. the dissimilarity between the samples? With ordination plots, we see an approximation to the distances. If we want to compare the methods, we should look directly into the distance themselves, not just some ordination plot.

In our data there are 5 samples from each condition. It would be natural to expect these 5 samples are quite similar to each other given that they come from the same conditions. But, is this affected by if we used dada2 or vsearch for processing the data?

To quantify this, start with the dada2 data, and compute distances between all pairs of samples using the Bray-Curtis dissimilarity. Then we collect only those Bray-Curtis values who are between samples from the same condition, and store these in a table where we also have columns indicating the condition and the method. Here is some code skeleton for doing this:

# Assume you have phyloseq objects from above
sample.dta <- sample_data(___)
cond <- unique(___$condition)

# All Bray-Curtis values, stored as a matrix
bc.mat <- distance(___, method = ___) %>% 
  as.matrix()

# Extracting those from within conditions
bc_dada2.tbl <- NULL
for(i in 1:___){
  idx <- which(sample.dta$___ == cond[i])
  bc_dada2.tbl <- tibble(bray_curtis = as.numeric(as.dist(bc.mat[idx,idx])),
                   condition = ___,
                   method = "___") %>% 
    bind_rows(bc_dada2.tbl)
}

Then you do similar for the vsearch data. Finally, put the two table together (bind_rows()), and make a plot of the Bray-Curtis distances for each condition, and color by method:

ggplot(___) +
  geom_jitter(aes(x = ___, y = ___, color = ___), 
              width = 0.1, size = 4, alpha = 0.5)

Do you find any systematic difference between the methods?

3.3.7 Exercise solution

sample.dta <- sample_data(physeq.obj)
cond <- unique(sample.dta$condition)

# The dada2
# All Bray-Curtis values, stored as a matrix
bc.mat <- distance(physeq.obj, method = "bray") %>%
  as.matrix()
# Extracting those from within conditions
bc_dada2.tbl <- NULL
for(i in 1:length(cond)){
  idx <- which(sample.dta$condition == cond[i])
  bc_dada2.tbl <- tibble(bray_curtis = as.numeric(as.dist(bc.mat[idx,idx])),
                   condition = cond[i],
                   method = "dada2") %>%
    bind_rows(bc_dada2.tbl)
}

# The vsearch
# All Bray-Curtis values, stored as a matrix
bc.mat <- distance(physeq.objV, method = "bray") %>%
  as.matrix()
# Extracting those from within conditions
bc_vsearch.tbl <- NULL
for(i in 1:length(cond)){
  idx <- which(sample.dta$condition == cond[i])
  bc_vsearch.tbl <- tibble(bray_curtis = as.numeric(as.dist(bc.mat[idx,idx])),
                   condition = cond[i],
                   method = "vsearch") %>%
    bind_rows(bc_vsearch.tbl)
}

# Putting into same table and plotting
bind_rows(bc_dada2.tbl, bc_vsearch.tbl) %>%
  ggplot() +
  geom_jitter(aes(x = condition, y = bray_curtis, color = method),
              width = 0.1, size = 4, alpha = 0.5)

We cannot claim to see a clear difference ebtween the methods here. This is good news, since then the variation between samples from the same conditions does not depend on how we processed the data.



3.4 Melting the phyloseq object

It is often desirable to take a closer look at certain taxa, for example by looking at the relative abundance. Even if phyloseq has some plotting functions, we may end up wanting something special, and we would like to exploit our expertise in the general use of ggplot. An approach is then to ‘melt’ the complex physeq.obj into a big table:

ps.tbl <- psmelt(physeq.obj)

Inspect this table. Notice there is a column Abundance containing all the read counts from the otu_table, and the remaining columns are from the additional tables. We may now use this to create our own plots, e.g. a bar plot with relative abundances

pp_ra <- ggplot(ps.tbl) +
  geom_col(aes(x = Sample, y = Abundance, fill = Genus), position = "fill") +
  theme(axis.text.x = element_text(angle = 90)) +
  guides(fill = guide_legend(keyheight = 0.1))
print(pp_ra)

We can now divide this by the different conditions:

pp_rac <- ggplot(ps.tbl) +
  geom_col(aes(x = Sample, y = Abundance, fill = Genus), position = "fill") +
  facet_grid(~condition, scales = "free", space = "free") +
  theme(axis.text.x = element_text(angle = 90)) +
  guides(fill = guide_legend(keyheight = 0.1))
print(pp_rac)

Can you see something familiar?

3.4.1 Exercise - Relative abundance plot

Use the vearch results, and the phyloseq object physeq.objV, to make relative abundance barplot on phyla level, both general and divided by condition. Like this:

ps.tbl <- psmelt(physeq.objV)

pp_raV <- ggplot(ps.tbl) +
  geom_col(aes(x = Sample, y = Abundance, fill = phylum), position = "fill") +
  theme(axis.text.x = element_text(angle = 90)) +
  ggtitle("The overall relative abundance for Phyla") +
  guides(fill = guide_legend(keyheight = 0.1))
pp_raV

pp_racV <- ggplot(ps.tbl) +
  geom_col(aes(x = Sample, y = Abundance, fill = phylum), position = "fill") +
  facet_grid(~condition, scales = "free", space = "free") +
  theme(axis.text.x = element_text(angle = 90)) +
  ggtitle("The realtive abundance for Phyla divded by condition") +
  guides(fill = guide_legend(keyheight = 0.1))
pp_racV

3.4.2 Exercise solution

ps.tbl <- psmelt(physeq.objV)

pp_raV <- ggplot(ps.tbl) +
  geom_col(aes(x = Sample, y = Abundance, fill = phylum), position = "fill") +
  theme(axis.text.x = element_text(angle = 90)) +
  ggtitle("The overall relative abundance for Phyla") +
  guides(fill = guide_legend(keyheight = 0.1))
pp_raV

pp_racV <- ggplot(ps.tbl) +
  geom_col(aes(x = Sample, y = Abundance, fill = phylum), position = "fill") +
  facet_grid(~condition, scales = "free", space = "free") +
  theme(axis.text.x = element_text(angle = 90)) +
  ggtitle("The realtive abundance for Phyla divded by condition") +
  guides(fill = guide_legend(keyheight = 0.1))
pp_racV





4 Normalizations

An important aspect, not discussed in this modul, but usually important to consider when dealing with complex count data as we often have. All read count data matrices will suffer from the fact we have different library sizes between our samples. The library size is simply the total number of reads in a sample. This reflects our sequencing efforts, and is a purely technical issue having no relation to the composition of the underlying community. The data in our physeq.obj are all such raw read count data, which is why we saw the variable bar-heights above.

By normalization we refer to any attempts to remove this library size effect for subsequent analyses. This is a topic where strong opinions have been communicated over recent years!

First some notation. Consider the read count matrix \(X\) (otu_table in phyloseq) with \(p\) samples in the columns and \(n\) OTUs in the rows. The read count for OTU \(i\) and sample \(j\) we denote \(X_{i,j}\). The read counts for sample \(j\) is denoted \[X_{*j} = (X_{1,j}, X_{2,j},...,X_{n,j})\]

where we have data for \(n\) OTUs. Think of this vector as column \(j\) in the otu_table. The normalized version of \(X_{*j}\) we denote \(Y_{*j}\). This will contain the same number of data, but their value is different from those in \(X_{*j}\).

4.1 The TSS

By dividing all read counts in a sample by the total for that sample, we get relative abundances, i.e. the sum of all abundances is 1.0 for all samples. This is one way of normalizing the data to get rid of the library size effect, and is often referred to as the TSS (total sum scaling) normalization. This means \[Y_{*j} = \frac{X_{*j}}{\sum_{i=1}^{n}{X_{i,j}}}\] We will, however, loose some information by doing this. A sample that has been sequenced extensively will in general contain more taxa. This is not because the community actually contains more taxa, but because we have explored it more intensly. Thus, by not taking this into consideration we may reach some bad conclusions. Some alpha diversity statistics are sensitive to this.

4.2 The rarefying

An alternative way of normalization is rarefying. This basically means random re-sampling all samples to the same level of the lowest read count, i.e. the sum of read counts becomes identical for all samples. Thus, it is very similar to TSS, but results in read counts instead of relative values. All samples sum to the same read count value.

There is no simple formula describing this, but we can easily implement it in R. Let us assume the smallest library size of our samples is S. Then:

for(j in 1:ncol(X)){
  Y[,j] <- rmultinom(1, size = S, prob = X[,j])
}

where we assume X and Y are existing (read count) matrices of proper dimensions. This procedure corresponds to re-sampling with replacement. This means read count for some taxon may be larger in \(Y_{*j}\) than it was in \(X_{*j}\) if S is close to the total sum in \(X_{*j}\). It is also possible to do re-sampling without replacement, which we will look at in an exercise.

It should be noted that a ‘method’ called Common Sum Scaling (COM) is simply:

  • Compute TSS
  • Multiply all value by the smallest sample read count sum
  • round to nearest integer

This produces ‘rarefied’ data by simply scaling the TSS values. It seems to me pointless, and we could stick to the TSS without multiplying them all with the same value.

Here is an article by the people behind the phyloseq package that dismisses rarefying altogether, as a rather stupid thing to do: https://www.ncbi.nlm.nih.gov/pubmed/24699258. The reason is that we throw away data. In addition, if we do the random sampling, we also get additional noise.

Even if this approach is debated, it is still used, and at least for some analyses it is relevant. By throwing away data, you make all analyses harder for yourself, but this is also considered a ‘safe’ error to make in science. It lowers the probability of both true positive and false positive findings, which is considered better than the opposite. Here is an article where rarefying is compared to other normalizations. Even the phyloseq package has a function rarefy_even_depth()!

4.3 The CSS

The Cumulative Sum Scaling is similar to TSS, but instead of dividing by the total sum in each sample, we sum only the smaller values. The reasoning here is that some taxa tend to be over-sampled, i.e. they have more reads than they should, and we should not include the largest read counts when doing the normalization.

For sample \(j\), assume we have sorted the read counts in ascending order. Then the formula is: \[Y_{*j} = \frac{X_{*j}}{\sum_{i=1}^{l}{X_{i,j}}}\] where the crucial point is we sum to \(l<n\) in the denominator, i.e. we do not divide by the total read count, but some smaller value.

Which value do we use for \(l\)? If we used \(l=n\) we are back to TSS. Compute first the median read count over all samples, \(m\). Then, compute the \(l\)-quantile for each sample, \(q_{jl}\). This is the \(l\)th largest value, i.e. if \(l=0.75\) then we pick value \(0.75\cdot n\) among the \(n\) sorted values. If we compare these quantiles to the median \(m\) we typically observe that as long as \(l\) is around \(0.5\) they are all near \(0\). As we increase \(l\) the differences to the median grows, and these differences also become very different from each other (larger variation). Thus, we stop when \(l\) produces differences to the median that reaches a tolerated variation set by the user. Note that if all samples are fairly similar to each other with respect to composition, we may end up using \(l=n\), i.e. a simple TSS normalization.

Note that a CSS normalization will in general not eliminate the library size effect, only lower it.

4.4 The RLE

The Relative Log Expression is an idea from gene expression (RNAseq) data analysis. It is described in the articles behind the R package DESeq2, which is used for analyzing gene expression read count data. Our read count data are not from gene expression studies, but are still read counts, and the idea may be used. We will revisit the DESeq2 in module 12.

To understand this we need some more notation. The geometric mean (as opposed to the arithmetic mean) of some vector with \(n\) elements is the product of the values, and then the \(n\)th root of this, i.e.

\[g(x) = (x_1\cdot x_2\cdots x_n)^{1/n}\] Next, let \(X_{i*}\) be the read count values for OTU number \(i\) across all \(p\) samples in our read count matrix (this is a row in the otu_table). For each of these vectors we compute the geometric mean \[g(X_{i*}) = (X_{i,1}\cdot X_{i,2}\cdots X_{i,p})^{1/p}\] Note that if you take the logarithm on both sides, the right hand side is simply the arithmetic mean (average) of the log-transformed read counts. Think of \(g(X_{i*})\) as some sort of average (log) read count for OTU number \(i\).

Next, we want to compute the size factor for sample \(j\). First we compute \[S_j = median(\frac{X_{1,j}}{g(X_{1*})}, \frac{X_{2,j}}{g(X_{2*})},..., \frac{X_{n,j}}{g(X_{n*})} )\] Note that the elements we take the median of are read counts for each OTU scaled by the ‘average’ (geometric mean) read count for that OTU in the entire data set, Thus, if sample \(j\) has a large library size, the \(S_j\) becomes large, and smaller if the number of reads is low. The size factor is then

\[Sz_j = \frac{S_j}{g(S_1, S_2,...,S_p)}\] where the denominator is the geometric mean of the \(S_j\) values (one from each of the \(p\) samples). Again, the \(Sz_j\) reflects the library size, a larger \(Sz_j\) means sample \(j\) contains many reads compared to the rest (and vice versa). It is also a value around 1, above 1 if the sample is large and below 1 if the sample is small compared to the rest. Then (finally) the normalization is

\[Y_{*j} = \frac{\sum_{j=1}^{p}Sz_j/p}{Sz_j} X_{*j}\] We observe that the normalized read counts are the original ones multiplied by a factor. If the size factor \(Sz_j\) is larger than average for the samples, the factor we multiply with above is less than 1, i.e. read counts are reduced. Opposite, if the library size is small, the \(Sz_j\) is less than average, and we scale the read counts up.

4.5 Imputing zeros

There is one issue we should mention here now. In the last method we computed the geometric mean of some vector: \[g(x) = (x_1\cdot x_2\cdots x_n)^{1/n}\] Notice that if one of the elements is \(0\), then the entire geometric mean is \(0\)! In our read count matrices the value \(0\) occurs quite frequently, i.e. some OTU is not observed in some sample. This seems to be devastating to this entire procedure (some of the \(S_j\)’s become infinite!).

Zeros occur in a read count matrix for at least two potential reasons:

  • An OTU has zero reads because this OTU is simply not present in the sample
  • An OTU has zero counts because we haven’t sequenced deep enough to detect it

The first is denoted an essential zero while the second is a count zero. In the latter case, we should actually not treat the value as zero, but as some small value indicating presence, but in too small quantities to be detected (yet). If you have several comparable samples, you will typically find that in the samples where you have the most reads, you also tend to have more OTUs. This is an indication of count zeros.

In such cases we impute the zeros, i.e. we replace them by some positive value indicating presence, but at very low abundance. The simplest approach is to add pseudo counts. This means we add a small and fixed value to all cells in the matrix, e.g. we add \(1\) read count extra to all cells (also the nonzero cells).

A more sophisticated approach is to use the data to impute a more ‘correct’ value in each cell. If we look at all data we may typically find that some zero OTUs should have a larger imputed value than others. The R package zCompositions has some nice methods for doing this.

NOTE: Imputed data are no longer raw data. Imputing data is typically a good idea prior to plotting and explorative analyses. For some analyses you should be careful imputing data. Some of the alpha diversity and distance methods (see below) needs information about presence/absence of the OTUs. This is obviously distored by any imputation. On the other hand, if our zeros are mostly count zeros (not essential zeros), any analysis method relying on presence/absence type of information is perhaps not a good idea?



4.6 The CLR

The Centered Log Ratio transform also uses geometric means, but is much simpler than the RLE. The formula is simply \[Y_{*j} = \log( \frac{X_{*j}}{g(X_{*j})} )\] This is often referred to as Aitchisons transform of compositional data in the literature, see https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5695134/. By compositional data we refer to data where the relative composition (between OTUs) is informative, while the total sum in each sample is not. The CLR method can be seen as an extension of the TSS. It is straightforward to show that if you first do TSS on your data, and then CLR, you get the exact same result as doing CLR right away. We say the CLR is scale-invariant. After CLR the data are no longer ‘trapped inside a simplex’, but can take all kinds of values, also negative. The data are now log-ratios, and the log of a ratio smaller than 1 is negative.

The fact that read count data are compositional is not always acknowledged, and here is a paper where this is pointed out and discussed. There is a range of methods developed for compositional data, and this is slowly finding its way into the bioinformatics of read count table analysis. This figure is copied from the paper linked to above:

The essence to remember is that the number of reads we sequence is limited/fixed by our procedure/technology, and not by the composition of the taxa in the community. Thus, when a proportion of taxon A increases from sample 1 to 2, it is not necessarily because there are more of taxon A, but could be because there are less of taxon B. This negative correlation structure imposes some challenges to methods that would work fine in other, non-compositional, cases. The CLR transform has been shown to have some nice properties for compositional data.

4.6.1 Exercise - normalize by TSS

In the phyloseq package we find the function transform_sample_counts(). This we may use to normalize the read count values for each sample. It takes two essential arguments

  • The phyloseq object with the data.
  • A function to be used on each sample in the otu_table.

Perform the TSS normalization this way. It means we need a function that will take as input a vector of read counts (the \(X_{*j}\) above) and then compute the normalized vector (\(Y_{'j}\)) by simply dividing by its sum. Here is a small R function doing this

TSS <- function(Xj){
  Yj <- Xj / sum(Xj)
  return(Yj)
}

Add the code that normalizes physeq.obj, physeq.objV and stores the results in physeq.tss and physeq.tssV, respectibvely. Then, use the function otu_table() to extract the normalized read count matrix from the latter object, and compute the sum for each sample vector to verify the values are TSS normalized.

4.6.2 Exercise solution

physeq.tss <- transform_sample_counts(physeq.obj, TSS)
colSums(otu_table(physeq.tss))
## sample_1A sample_2A sample_3A sample_4A sample_5A sample_1B sample_2B sample_3B 
##         1         1         1         1         1         1         1         1 
## sample_4B sample_5B sample_1C sample_2C sample_3C sample_4C sample_5C 
##         1         1         1         1         1         1         1
physeq.tssV <- transform_sample_counts(physeq.objV, TSS)
colSums(otu_table(physeq.tssV))
## sample_1A sample_1B sample_1C sample_2A sample_2B sample_2C sample_3A sample_3B 
##         1         1         1         1         1         1         1         1 
## sample_3C sample_4A sample_4B sample_4C sample_5A sample_5B sample_5C 
##         1         1         1         1         1         1         1

4.6.3 Exercise - normalize by rarefying

Use the rarefy_even_depth() function to compute rarefied read counts, where we sample without replacement, and store this in physeq.rar and physeq.rarV. Again, compute the sum of values in all samples in the normalized otu_table.

4.6.4 Exercise solution

physeq.rar <- rarefy_even_depth(physeq.obj, replace = F)
colSums(otu_table(physeq.rar))
## sample_1A sample_2A sample_3A sample_4A sample_5A sample_1B sample_2B sample_3B 
##      1901      1901      1901      1901      1901      1901      1901      1901 
## sample_4B sample_5B sample_1C sample_2C sample_3C sample_4C sample_5C 
##      1901      1901      1901      1901      1901      1901      1901
physeq.rarV <- rarefy_even_depth(physeq.objV, replace = F)
colSums(otu_table(physeq.rarV))
## sample_1A sample_1B sample_1C sample_2A sample_2B sample_2C sample_3A sample_3B 
##     10412     10412     10412     10412     10412     10412     10412     10412 
## sample_3C sample_4A sample_4B sample_4C sample_5A sample_5B sample_5C 
##     10412     10412     10412     10412     10412     10412     10412

4.6.5 Exercise - normalize by CLR

Make a function that normalizes a sample by the CLR. Make adding pseudo-counts an option, and use 1 as default value. Then, use the transform_sample_counts() from above to normalize, and store this in physeq.clr and physeq.clrV, respectively. Again, compute the sum for each normalized sample.

Hint: Instead of computing the geometric mean, compute directly its logarithm. Use the rules for logarithms from introductory mathematics.

4.6.6 Exercise solution

CLR <- function(Xj, q = 1){
  lXj <- log(Xj + q)
  lg <- sum(lXj) / length(lXj)
  Yj <- lXj - lg
  return(Yj)
}
physeq.clr <- transform_sample_counts(physeq.obj, CLR)
colSums(otu_table(physeq.clr))
##     sample_1A     sample_2A     sample_3A     sample_4A     sample_5A 
##  1.332268e-15  5.329071e-15  7.105427e-15 -1.065814e-14 -8.437695e-15 
##     sample_1B     sample_2B     sample_3B     sample_4B     sample_5B 
##  1.887379e-15  5.773160e-15 -5.773160e-15 -3.996803e-15  4.440892e-16 
##     sample_1C     sample_2C     sample_3C     sample_4C     sample_5C 
##  1.332268e-15 -6.217249e-15 -1.332268e-15  5.995204e-15  1.199041e-14
physeq.clrV <- transform_sample_counts(physeq.objV, CLR)
colSums(otu_table(physeq.clrV))
##     sample_1A     sample_1B     sample_1C     sample_2A     sample_2B 
## -4.440892e-15 -1.110223e-14  1.110223e-14 -4.440892e-15 -4.440892e-15 
##     sample_2C     sample_3A     sample_3B     sample_3C     sample_4A 
##  4.884981e-15  3.552714e-15  9.769963e-15 -1.332268e-14 -1.421085e-14 
##     sample_4B     sample_4C     sample_5A     sample_5B     sample_5C 
##  8.881784e-16  4.440892e-16  3.108624e-15 -6.661338e-15  1.776357e-15