I have had the 2014 paper “Waste Not, Want Not: Why Rarefying Microbiome Data is Inadmissable” by McMurdie and Holmes sitting on my desk for a while now. Yesterday I finally got around to reading it and was immediately a little skeptical as a result of the hyperbole with which they criticized the common practice of subsampling* libraries of 16S rRNA gene reads during microbial community analysis. The logic of that practice proceeds like this:
To develop 16S rRNA gene data (or any other marker gene data) describing a microbial community we generally collect an environmental sample, extract the DNA, amplify it, and sequence the amplified material. The extract might contain the DNA from 1 billion microbial cells present in the environment. Only a tiny fraction (<< 1 %) of these DNA molecules actually get sequenced; a “good” sequence run might contain only tens of thousands of sequences per sample. After quality control and normalizing for multiple copies of the 16S rRNA gene it will contain far fewer. Thus the final dataset contains a very small random sample of the original population of DNA molecules.
Most sequence-based microbial ecology studies involve some kind of comparison between samples or experimental treatments. It makes no sense to compare the abundance of taxa between two datasets of different sizes, as the dissimilarity between the datasets will appear much greater than it actually is. One solution is to normalize by dividing the abundance of each taxa by the total reads in the dataset to get their relative abundance. In theory this works great, but has the disadvantage that it does not take into account that a larger dataset has sampled the original population of DNA molecules deeper. Thus more rare taxa might be represented. A common practice is to reduce the amount of information present in the larger dataset by subsampling to the size of the smaller dataset. This attempts to approximate the case where both datasets undertake the same level of random sampling.
McMurdie and Holmes argue that this approach is indefensible for two common types of analysis; identifying differences in community structure between multiple samples or treatments, and identifying differences in abundance for specific taxa between samples and treatments. I think the authors do make a reasonable argument for the latter analysis; however, at worst the use of subsampling and/or normalization simply reduces the sensitivity of the analysis. I suspect that dissimilarity calculations between treatments or samples using realistic datasets are much less sensitive to reasonable subsampling than the authors suggest. I confess that I am (clearly) very far from being a statistician and there is a lot in McMurdie and Holmes, 2014 that I’m still trying to digest. I hope that our colleagues in statistics and applied mathematics continue optimizing these (and other) methods so that microbial ecology can improve as a quantitative science. There’s no need however, to get everyone spun up without a good reason. To try and understand if there is a good reason, at least with respect to my data and analysis goals, I undertook some further exploration. I would strongly welcome any suggestions, observations, and criticisms to this post!
My read of McMurdi and Homes is that the authors object to subsampling because it disregards data that is present that could be used by more sophisticated (i.e. parametric) methods to estimate the true abundance of taxa. This is true; data discarded from larger datasets does have information that can be used to estimate the true abundance of taxa among samples. The question is how much of a difference does it really make? McMurdi and Holmes advocate using methods adopted from transcriptome analysis. These methods are necessary for transcriptomes because 1) the primary objective of the study is not usually to see if one transcriptome is different from another, but which genes are differentially expressed and 2) I posit that the abundance distribution of transcript data is different than the taxa abundance distribution. An example of this can be seen in the plots below.
In both the cases the data is log distributed, with a few very abundant taxa or PFAMs and many rare ones. What constitutes “abundant” in the transcript dataset however, is very different than “abundant” in the community structure dataset. The transcript dataset is roughly half the size (n = 9,000 vs. n = 21,000), nonetheless the most abundant transcript has an abundance of 209. The most abundant OTU has an abundance of 6,589, and there are several very abundant OTUs. Intuitively this suggests to me that the structure of the taxon dataset is much more reproducible via subsampling than the structure of the transcript dataset, as the most abundant OTUs have a high probability of being sampled. The longer tail of the transcript data contributes to this as well, though of course this tail is controlled to a large extent by the classification scheme used (here PFAMs).
To get an idea of how reproducible the underlying structure was for the community structure and transcript data I repeatedly subsampled both (with replacement) to 3000 and 1300 observations, respectively. For the community structure data this is about the lower end of the sample size I would use in an actual analysis – amplicon libraries this small are probably technically biased and should be excluded (McMurdi and Holmes avoid this point at several spots in the paper). The results of subsampling are shown in these heatmaps, where each row is a new subsample. For brevity only columns summing to an abundance > 100 are shown.
In these figures the warmer colors indicate a higher number of observations for that OTU/PFAM. The transcript data is a lot less reproducible than the community structure data; the Bray-Curtis dissimilarity across all iterations maxes out at 0.11 for community structure and 0.56 for the transcripts. The extreme case would be if the data were normally distributed (i.e. few abundant and few rare observations, many intermediate observations). Here’s what subsampling does to normally distributed data (n = 21,000, mean = 1000, sd = 200):
If you have normally distributed data don’t subsample!
For the rest of us it seems that for the test dataset used here community structure is at least somewhat reproducible via subsampling. There are differences between iterations however, what does this mean in the context of the larger study?
The sample was drawn from a multiyear time series from a coastal marine site. The next most similar sample (in terms of community composition and ecology) was, not surprisingly, a sample taken one week later. By treating this sample in an identical fashion, then combining the two datasets, it was possible to evaluate how easy it is to tell the two samples apart after subsampling. In this heatmap the row colors red and black indicate iterations belonging to the two samples:
As this heatmap shows, for these two samples there is perfect fidelity. Presumably with very similar samples this would start to break down, determining how similar samples need to be before they cannot be distinguished at a given level of subsampling would be a useful exercise. The authors attempt to do this in Simlation A/Figure 5 in the paper, but it isn’t clear to me why their results are so poor – particularly given very different sample types and a more sophisticated clustering method than I’ve applied here.
As a solution – necessary for transcript data, normally distributed data, or for analyses of differential abundance, probably less essential for comparisons of community structure – the authors propose a mixture model approach that takes in account variance across replicates to estimate “real” OTU abundance. Three R packages that can do this are mentioned; edgeR, DESeq, and metagenomeSeq. The problem with these methods – as I understand them – is that they require experimental replicates. According to the DESeq authors, technical replicates should be summed, and samples should be assigned to treatment pools (e.g. control, treatment 1, treatment 2…). Variance is calculated within each pool and this is used to to model differences between pools. This is great for a factor-based analysis, as is common in transcriptome analysis or human microbial ecology studies. If you want to find a rare, potentially disease-causing strain differently present between a healthy control group and a symptomatic experimental group for example, this is a great way to go about it.
There are many environmental studies for which these techniques are not useful however, as it may be impractical to collect experimental replicates. For example it is both undesirable and impractical to conduct triplicate or even duplicate sampling in studies focused on high-resolution spatial or temporal sampling. Sequencing might be cheap now, but time and reagents are not. Some of the methods I’m working with are designed to aggregate samples into higher-level groups – at which point these methods could be applied by treating within-group samples as “replicates” – but this is only useful if we are interested in testing differential abundance between groups (and doesn’t solve the problem of needing to get samples grouped in the first place).
These methods can be used to explore differential abundance in non-replicated samples, however, they are grossly underpowered when used without replication. Here’s an analysis of differential abundance between the sample in the first heatmap above and its least similar companion from the same (ongoing) study using DESeq. You can follow along with any standard abundance table where the rows are samples and the columns are variables.
library(DESeq) ## DESeq wants to oriented opposite how community abundance ## data is normally presented (e.g. to vegan) data.dsq <- t(data) ## DESeq requires a set of conditions which are factors. Ideally ## this would be control and treatment groups, or experimental pools ## or some such, but we don't have that here. So the conditions are ## unique column names (which happen to be dates). conditions <- factor(as.character(colnames(data.dsq))) ## As a result of 16S rRNA gene copy number normalization abundance ## data is floating point numbers, convert to integers. data.dsq <- ceiling(data.dsq, 0) ## Now start working with DESeq. data.ct <- newCountDataSet(as.data.frame(data.dsq), conditions = conditions) data.size <- estimateSizeFactors(data.ct) ## This method and sharing mode is required for unreplicated samples. data.disp <- estimateDispersions(data.size, method = 'blind', sharingMode="fit-only") ## And now we can execute a test of differential abundance for the ## two samples used in the above example. test <- nbinomTest(data.disp, '2014-01-01', '2014-01-06') test <- na.omit(test) ## Plot the results. plot(test$baseMeanA, test$baseMeanB, #log = 'xy', pch = 19, ylim = c(0, max(test$baseMeanB)), xlim = c(0, max(test$baseMeanA)), xlab = '2014-01-01', ylab = '2009-12-14') abline(0, 1) ## If we had significant differences we could then plot them like ## this: points(test$baseMeanA[which(test$pval < 0.05)], test$baseMeanB[which(test$pval < 0.05)], pch = 19, col = 'red')
As we would expect there are quite a few taxa present in high abundance in one sample and not the other, however, none of the associated p-values are anywhere near significant. I’m tempted to try to use subsampling to create replicates, which would allow an estimate of variance across subsamples and access to greater statistical power. This is clearly not as good as real biological replication, but we have to work within the constraints of our experiments, time, and funding…
*You might notice that I’ve deliberately avoided using the terms “microbiome” and “rarefying” here. In one of his comics Randall Munroe asserted that the number of made up words in a book is inversely proportional to the quality of the book, similarly I strongly suspect that the validity of a sub-discipline is inversely proportional to the number of jargonistic words that community makes up to describe its work. As a member of said community, what’s wrong with subsampling and microbial community??