Extraordinary Years Now the Normal Years: Scientists Survey Radical Melt in Arctic - Washington Post
I have had the 2014 paper “Waste Not, Want Not: Why Rarefying Microbiome Data is Inadmissable” by McMurdie and Holmes sitting on my desk for a while now. Yesterday I finally got around to reading it and was immediately a little skeptical as a result of the hyperbole with which they criticized the common practice of subsampling* libraries of 16S rRNA gene reads during microbial community analysis. The logic of that practice proceeds like this:
To develop 16S rRNA gene data (or any other marker gene data) describing a microbial community we generally collect an environmental sample, extract the DNA, amplify it, and sequence the amplified material. The extract might contain the DNA from 1 billion microbial cells present in the environment. Only a tiny fraction (<< 1 %) of these DNA molecules actually get sequenced; a “good” sequence run might contain only tens of thousands of sequences per sample. After quality control and normalizing for multiple copies of the 16S rRNA gene it will contain far fewer. Thus the final dataset contains a very small random sample of the original population of DNA molecules.
Most sequence-based microbial ecology studies involve some kind of comparison between samples or experimental treatments. It makes no sense to compare the abundance of taxa between two datasets of different sizes, as the dissimilarity between the datasets will appear much greater than it actually is. One solution is to normalize by dividing the abundance of each taxa by the total reads in the dataset to get their relative abundance. In theory this works great, but has the disadvantage that it does not take into account that a larger dataset has sampled the original population of DNA molecules deeper. Thus more rare taxa might be represented. A common practice is to reduce the amount of information present in the larger dataset by subsampling to the size of the smaller dataset. This attempts to approximate the case where both datasets undertake the same level of random sampling.
McMurdie and Holmes argue that this approach is indefensible for two common types of analysis; identifying differences in community structure between multiple samples or treatments, and identifying differences in abundance for specific taxa between samples and treatments. I think the authors do make a reasonable argument for the latter analysis; however, at worst the use of subsampling and/or normalization simply reduces the sensitivity of the analysis. I suspect that dissimilarity calculations between treatments or samples using realistic datasets are much less sensitive to reasonable subsampling than the authors suggest. I confess that I am (clearly) very far from being a statistician and there is a lot in McMurdie and Holmes, 2014 that I’m still trying to digest. I hope that our colleagues in statistics and applied mathematics continue optimizing these (and other) methods so that microbial ecology can improve as a quantitative science. There’s no need however, to get everyone spun up without a good reason. To try and understand if there is a good reason, at least with respect to my data and analysis goals, I undertook some further exploration. I would strongly welcome any suggestions, observations, and criticisms to this post!
My read of McMurdi and Homes is that the authors object to subsampling because it disregards data that is present that could be used by more sophisticated (i.e. parametric) methods to estimate the true abundance of taxa. This is true; data discarded from larger datasets does have information that can be used to estimate the true abundance of taxa among samples. The question is how much of a difference does it really make? McMurdi and Holmes advocate using methods adopted from transcriptome analysis. These methods are necessary for transcriptomes because 1) the primary objective of the study is not usually to see if one transcriptome is different from another, but which genes are differentially expressed and 2) I posit that the abundance distribution of transcript data is different than the taxa abundance distribution. An example of this can be seen in the plots below.
In both the cases the data is log distributed, with a few very abundant taxa or PFAMs and many rare ones. What constitutes “abundant” in the transcript dataset however, is very different than “abundant” in the community structure dataset. The transcript dataset is roughly half the size (n = 9,000 vs. n = 21,000), nonetheless the most abundant transcript has an abundance of 209. The most abundant OTU has an abundance of 6,589, and there are several very abundant OTUs. Intuitively this suggests to me that the structure of the taxon dataset is much more reproducible via subsampling than the structure of the transcript dataset, as the most abundant OTUs have a high probability of being sampled. The longer tail of the transcript data contributes to this as well, though of course this tail is controlled to a large extent by the classification scheme used (here PFAMs).
To get an idea of how reproducible the underlying structure was for the community structure and transcript data I repeatedly subsampled both (with replacement) to 3000 and 1300 observations, respectively. For the community structure data this is about the lower end of the sample size I would use in an actual analysis – amplicon libraries this small are probably technically biased and should be excluded (McMurdi and Holmes avoid this point at several spots in the paper). The results of subsampling are shown in these heatmaps, where each row is a new subsample. For brevity only columns summing to an abundance > 100 are shown.
In these figures the warmer colors indicate a higher number of observations for that OTU/PFAM. The transcript data is a lot less reproducible than the community structure data; the Bray-Curtis dissimilarity across all iterations maxes out at 0.11 for community structure and 0.56 for the transcripts. The extreme case would be if the data were normally distributed (i.e. few abundant and few rare observations, many intermediate observations). Here’s what subsampling does to normally distributed data (n = 21,000, mean = 1000, sd = 200):
If you have normally distributed data don’t subsample!
For the rest of us it seems that for the test dataset used here community structure is at least somewhat reproducible via subsampling. There are differences between iterations however, what does this mean in the context of the larger study?
The sample was drawn from a multiyear time series from a coastal marine site. The next most similar sample (in terms of community composition and ecology) was, not surprisingly, a sample taken one week later. By treating this sample in an identical fashion, then combining the two datasets, it was possible to evaluate how easy it is to tell the two samples apart after subsampling. In this heatmap the row colors red and black indicate iterations belonging to the two samples:
As this heatmap shows, for these two samples there is perfect fidelity. Presumably with very similar samples this would start to break down, determining how similar samples need to be before they cannot be distinguished at a given level of subsampling would be a useful exercise. The authors attempt to do this in Simlation A/Figure 5 in the paper, but it isn’t clear to me why their results are so poor – particularly given very different sample types and a more sophisticated clustering method than I’ve applied here.
As a solution – necessary for transcript data, normally distributed data, or for analyses of differential abundance, probably less essential for comparisons of community structure – the authors propose a mixture model approach that takes in account variance across replicates to estimate “real” OTU abundance. Three R packages that can do this are mentioned; edgeR, DESeq, and metagenomeSeq. The problem with these methods – as I understand them – is that they require experimental replicates. According to the DESeq authors, technical replicates should be summed, and samples should be assigned to treatment pools (e.g. control, treatment 1, treatment 2…). Variance is calculated within each pool and this is used to to model differences between pools. This is great for a factor-based analysis, as is common in transcriptome analysis or human microbial ecology studies. If you want to find a rare, potentially disease-causing strain differently present between a healthy control group and a symptomatic experimental group for example, this is a great way to go about it.
There are many environmental studies for which these techniques are not useful however, as it may be impractical to collect experimental replicates. For example it is both undesirable and impractical to conduct triplicate or even duplicate sampling in studies focused on high-resolution spatial or temporal sampling. Sequencing might be cheap now, but time and reagents are not. Some of the methods I’m working with are designed to aggregate samples into higher-level groups – at which point these methods could be applied by treating within-group samples as “replicates” – but this is only useful if we are interested in testing differential abundance between groups (and doesn’t solve the problem of needing to get samples grouped in the first place).
These methods can be used to explore differential abundance in non-replicated samples, however, they are grossly underpowered when used without replication. Here’s an analysis of differential abundance between the sample in the first heatmap above and its least similar companion from the same (ongoing) study using DESeq. You can follow along with any standard abundance table where the rows are samples and the columns are variables.library(DESeq) ## DESeq wants to oriented opposite how community abundance ## data is normally presented (e.g. to vegan) data.dsq <- t(data) ## DESeq requires a set of conditions which are factors. Ideally ## this would be control and treatment groups, or experimental pools ## or some such, but we don't have that here. So the conditions are ## unique column names (which happen to be dates). conditions <- factor(as.character(colnames(data.dsq))) ## As a result of 16S rRNA gene copy number normalization abundance ## data is floating point numbers, convert to integers. data.dsq <- ceiling(data.dsq, 0) ## Now start working with DESeq. data.ct <- newCountDataSet(as.data.frame(data.dsq), conditions = conditions) data.size <- estimateSizeFactors(data.ct) ## This method and sharing mode is required for unreplicated samples. data.disp <- estimateDispersions(data.size, method = 'blind', sharingMode="fit-only") ## And now we can execute a test of differential abundance for the ## two samples used in the above example. test <- nbinomTest(data.disp, '2014-01-01', '2014-01-06') test <- na.omit(test) ## Plot the results. plot(test$baseMeanA, test$baseMeanB, #log = 'xy', pch = 19, ylim = c(0, max(test$baseMeanB)), xlim = c(0, max(test$baseMeanA)), xlab = '2014-01-01', ylab = '2009-12-14') abline(0, 1) ## If we had significant differences we could then plot them like this: points(test$baseMeanA[which(test$pval < 0.05)], test$baseMeanB[which(test$pval < 0.05)], pch = 19, col = 'red')
As we would expect there are quite a few taxa present in high abundance in one sample and not the other, however, none of the associated p-values are anywhere near significant. I’m tempted to try to use subsampling to create replicates, which would allow an estimate of variance across subsamples and access to greater statistical power. This is clearly not as good as real biological replication, but we have to work within the constraints of our experiments, time, and funding…
*You might notice that I’ve deliberately avoided using the terms “microbiome” and “rarefying” here. In one of his comics Randall Munroe asserted that the number of made up words in a book is inversely proportional to the quality of the book, similarly I strongly suspect that the validity of a sub-discipline is inversely proportional to the number of jargonistic words that community makes up to describe its work. As a member of said community, what’s wrong with subsampling and microbial community??
The paprica pipeline was designed to infer the genomic content and genomic characteristics of a set of 16S rRNA gene reads. To enable this the paprica database organizes this information by phylogeny for many of the completed genomes in Genbank. In addition to metabolic inference this provides an opportunity to explore how genome content and genomic characteristics are organized phylogenetically. The following is a brief analysis of some genomic features using the paprica database and R. If you aren’t familiar with the paprica database this exercise will also familiarize you with some of its content and its organization.
The paprica pipeline and database can be obtained from Github here. In this post I’ll be using the database associated with version 0.3.1. The necessary files from the bacteria database (one could also conduct this analysis on the much smaller archaeal database) can be read into R as such:## Read in the pathways associated with the terminal nodes on the reference tree path <- read.csv('paprica/ref_genome_database/bacteria/terminal_paths.csv', row.names = 1) path[is.na(path)] <- 0 ## Read in the data associated with all completed genomes in Genbank data <- read.csv('paprica/ref_genome_database/bacteria/genome_data.final.csv', row.names = 1) ## During database creation genomes with duplicate 16S rRNA genes were removed, ## so limit to those that were retained data <- data[row.names(data) %in% row.names(path),] ## "path" is ordered by clade, meaning it is in top to bottom order of the reference tree, ## however, "data" is not, so order it here data <- data[order(data$clade),]
One fun thing to do at this point is to look at the distribution of metabolic pathways across the database. To develop a sensible view it is best to cluster the pathways according to which genomes they are found in.## The pathway file in the database is binary, so we use Jaccard for distance library('vegan') path.dist <- vegdist(t(path), method = 'jaccard') # distance between pathways (not samples!) path.clust <- hclust(path.dist)
The heatmap function is a bit cumbersome for this large matrix, so the visualization can be made using the image function.## Set a binary color scheme image.col <- colorRampPalette(c('white', 'blue'))(2) ## Image will order matrix in ascending order, which is not what we want here! image(t(data.matrix(path))[rev(path.clust$order),length(row.names(path)):1], col = image.col, ylab = 'Genome', xlab = 'Pathway', xaxt = 'n', yaxt = 'n') box()
There are a couple of interesting things to note in this plot. First, we can see the expected distribution of core pathways present in nearly all genomes, and the interesting clusters of pathways that are unique to a specific lineage. For clarity row names have been omitted from the above plot, but from within R you can pull out the taxa or pathways that interest you easily enough. Second, there are some genomes that have very few pathways. There are a couple of possible reasons for this that can be explored with a little more effort. One possibility is that these are poorly annotated genomes, or at least the annotation didn’t associate many or any coding sequences with either EC numbers or GO terms – the two pieces of information Pathway-Tools uses to predict pathways during database construction. Another possibility is that these genomes belong to obligate symbionts (either parasites or beneficial symbionts). Obligate symbionts often have highly streamlined genomes and few complete pathways. We can compare the number of pathways in each genome to other genome characteristics for additional clues.
A reasonable assumption is that the number of pathways in each genome should scale with the size of the genome. Large genomes with few predicted pathways might indicate places where the annotation isn’t compatible with the pathway prediction methods.## Plot the number of pathways as a function of genome size plot(rowSums(path) ~ data$genome_size, ylab = 'nPaths', xlab = 'Genome size') ## Plot P. ubique as a reference point select <- grep('Pelagibacter ubique HTCC1062', data$organism_name) points(rowSums(path)[select] ~ data$genome_size[select], pch = 19, col = 'red')
That looks pretty good. For the most part more metabolic pathways were predicted for larger genomes, however, there are some exceptions. The red point gives the location of Pelagibacter ubique HTCC1062. This marine bacterium is optimized for life under energy-limited conditions. Among its adaptations are a highly efficient and streamlined genome. In fact it has the smallest genome of any known free-living bacterium. All the points below it on the x-axis are obligate symbionts; these are dependent on their host for some of their metabolic needs. There are a few larger genomes that have very few (or even no) pathways predicted. These are the genomes with bad, or at least incompatible annotations (or particularly peculiar biochemistry).
The other genome parameters in paprica are the number of coding sequences identified (nCDS), the number of genetic elements (nge), the number of 16S rRNA gene copies (n16S), GC content (GC), and phi; a measure of genomic plasticity. We can make another plot to show the distribution of these parameters with respect to phylogeny.## Grab only the data columns we want data.select <- data[,c('n16S', 'nge', 'ncds', 'genome_size', 'phi', 'GC')] ## Make the units somewhat comparable on the same scale, a more ## careful approach would log-normalize some of the units first data.select.norm <- decostand(data.select, method = 'standardize') data.select.norm <- decostand(data.select.norm, method = 'range') ## Plot with a heatmap heat.col <- colorRampPalette(c('blue', 'white', 'red'))(100) heatmap(data.matrix(data.select.norm), margins = c(10, 20), col = heat.col, Rowv = NA, Colv = NA, scale = NULL, labRow = 'n', cexCol = 0.8)
Squinting at this plot it looks like GC content and phi are potentially negatively correlated, which could be quite interesting. These two parameters can be plotted to get a better view:plot(data.select$phi ~ data.select$GC, xlab = 'GC', ylab = 'phi')
Okay, not so much… but I think the pattern here is pretty interesting. Above a GC content of 50 % there appears to be no relationship, but these parameters do seem correlated for low GC genomes. This can be evaluated with linear models for genomes above and below 50 % GC.gc.phi.above50 <- lm(data.select$phi[which(data.select$GC >= 50)] ~ data.select$GC[which(data.select$GC >= 50)]) gc.phi.below50 <- lm(data.select$phi[which(data.select$GC < 50)] ~ data.select$GC[which(data.select$GC < 50)]) summary(gc.phi.above50) summary(gc.phi.below50) plot(data.select$phi ~ data.select$GC, xlab = 'GC', ylab = 'phi', type = 'n') points(data.select$phi[which(data.select$GC >= 50)] ~ data.select$GC[which(data.select$GC >= 50)], col = 'blue') points(data.select$phi[which(data.select$GC < 50)] ~ data.select$GC[which(data.select$GC < 50)], col = 'red') abline(gc.phi.above50, col = 'blue') abline(gc.phi.below50, col = 'red') legend('bottomleft', bg = 'white', legend = c('GC >= 50', 'GC < 50'), col = c('blue', 'red'), pch = 1)
As expected there is no correlation between genomic plasticity and GC content for the high GC genomes (R2 = 0) and a highly significant correlation for the low GC genomes (albeit with weak predictive power; R2 = 0.106, p = 0). So what’s going on here? Low GC content is associated with particular microbial lineages but also with certain ecologies. The free-living low-energy specialist P. ubique HTCC1062 has a low GC content genome for example, as do many obligate symbionts regardless of their taxonomy (I don’t recall if it is known why this is). Both groups are associated with a high degree of genomic modification, including genome streamlining and horizontal gene transfer.