I’m very excited to report that our latest paper – Microbial communities can be described by metabolic structure: A general framework and application to a seasonally variable, depth-stratified microbial community from the coastal West Antarctic Peninsula was just published in the journal PLoS one. The paper builds on two very distinct bodies of work; a growing literature on microbial community structure and function along the climatically sensitive West Antarctic Peninsula, and a family of new techniques to predict community metabolic function from 16S rRNA gene libraries, which we are calling metabolic inference.
The motivation for metabolic inference is in the large amount of time that it takes to manually curate a likely set of functions for even a small collection of 16S rRNA genes. In today’s world, where most analyses of microbial community structure consist of many thousand of reads representing hundreds of taxa, it is simply impossible to dig through the literature on each strain to see what metabolic role each is likely to be playing. Ideally a researcher would use metagenomics or metatranscriptomics to get at this information directly, but it is not advisable or desirable in most cases to sequence hundreds of metagenomes or metatranscriptomes (necessary for the kind of temporal or spatial resolution many of us want these days). Metabolic inference provides a convenient alternative.
The basic concept behind all metabolic inference techniques (e.g. PICRUSt, tax4fun, PAPRICA) is hidden state prediction (HSP) (you can find a nice paper on HSP here). In 16S rRNA gene analysis metabolic potential is a hidden state. The metabolic inference techniques propose different ways to predict this hidden state based on the information available.
Our small contribution to this effort was to develop a method (PAPRICA – PAthway PRediction by phylogenetIC plAcement) that uses phylogenetic placement to conduct the metabolic inference instead of an OTU (operational taxonomic unit) based approach. Our approach provides a more intuitive connection between the 16S rRNA analysis and the HSP (or at least it does in my mind) and can increase the accuracy of the inference for taxa that have a lot of sequenced genomes.
Most analysis of large 16S rRNA datasets rely on an OTU based approach. In a typical OTU analysis an investigator aligns 16S rRNA reads, constructs a distance matrix of the alignments, and clusters the reads at some predetermined distance. By tradition the default distance has become a dissimilarity of 0.03. This approach has some advantages. By clustering reads into discrete units it is easy to quantify the presence or absence of different OTUs, and it allows microbial ecologists to avoid problems with defining prokaryotic species (which defy most of the criteria used to define species in more complex organisms). To conduct a metabolic inference on an OTU based analyses it is possible to simply reconstruct the likely metabolism for a predefined set of OTUs based on the OTU assignments of published genomes. This works great, but it limits the resolution of the inference to the selected OTU definition (i.e. 0.03). For some taxa, such as Escherichia coli (and plenty of more interesting environmental bugs), there are many sequenced genomes that have very similar 16S rRNA gene sequences. PAPRICA provides a way to improve the resolution of the metabolic inference for these taxa.
Our approach was to build a phylogenetic tree of the 16S rRNA genes from each completed genome. For each internal node on the reference tree we determine a “consensus genome”, defined as all genomes shared by all members of the clade originating from the node, and predict the metabolic pathways present in the consensus and complete genomes using Pathway-Tools. To conduct the actual analysis we use pplacer to place our query reads on the reference tree and assign the metabolic pathways for each point of placement to the query reads. One advantage to this approach is that the resolution changes depending on genomes sequence coverage of the reference tree. For families, genera, and even species for which lots of genomes have been sequenced resolution is high. For regions of the tree where there are not many sequenced genomes resolution is poor, however, the method will give you the best of what’s available.
PAPRICA provides some additional helpful pieces of information. We built in a confidence scoring metric that takes into account both predicted genomic plasticity and the size of the consensus genome relative to the mean size for the clade (deeper branching clades will have a bigger difference), and predicts the size of the genome and number of 16S rRNA gene copies associated with each 16S rRNA gene, both of which have a strong connection to the ecological role of a bacterium
For our initial application of PAPRICA we selected a previously published 16S rRNA gene sequence dataset from the West Antarctic Peninsula (our primary region of interest). One thing that we were very interested in looking at was whether we could describe differences between microbial communities organized along ecological gradients (e.g. inshore vs. offshore, or surface vs. deep water) in terms of metabolic structure in place of the more traditional 16S rRNA gene (i.e. taxonomic) structure. Using PAPRICA to convert the 16S rRNA gene sequences into collections of metabolic pathways we found that we could reconstruct the same inter-sample relationships identified by an analysis of taxonomic structure. This means that a microbial ecologist can, if they choose, disregard the messy and sometimes uninformative taxonomic structure data and go directly to metabolic structure without losing information. Applying common multivariate statistical approaches (PCA, MDS, etc.) to metabolic structure data yields information like which pathways are driving the variance between sites, and which are correlated with what environmental parameters. This information is much more relevant to most research questions than the distribution of different microbial taxa. It is worth noting that while inter-sample relationships are well preserved in metabolic structure, the absolute distance between samples is much less than for taxonomic structure. This might have some implications for the functional resilience of microbial communities, which we get into a little bit in the paper.
PAPRICA was an outgrowth of a couple of other papers that I’m working on. At some point the bioinformatic methods reached a point where separate publication was justified. As a result, and reflecting the fact that I’m much more an ecologist than a computational biologist, PAPRICA is not nearly as streamlined as PICRUSt (which is even available through an online interface). I’ve spent quite a bit of time, however, trying to make the scripts user friendly and transportable. Anyone should be able to get them to work without too much difficulty. If you decide to give PAPRICA a try and run into an hitches please let me know, either by posting an issue in Github or emailing me directly! Suggestions for improvement are also welcome.
HUGE THANKS to all the volunteers who worked so hard to make this project such a great success. It was a pleasure working with you and getting to know you all. Also mega thanks to all the landowners who were kind enough, and trusting enough, to let us put a source on their property. None of this could have happened without your generosity and spirit of curiosity. Thanks so much.
Ashley Nauer and Kent Anderson wire up a shot.
Donna Shillington, LDEO
The shot team filled in this hole the next day.Armadillo patrols one of shot sites.
“21757. Still kickin”
“We’re not coming back unless we have all of them!”
“We had a helper at site 20431!”
“Hello Donna Rach and I are crushing it right now”
“Daily check in, we’re making good time so we should see the puppies soon enough”
“Recovered a Texan at stop 20858. This one doesn’t seem to be working correctly, whenever I press it it just tells me things like “The Cowboys are America’s team” and “Bush was an American hero”. Weird.
“Stop 20804. Everything’s fine, except some guy came out of the woods and bit Brent. All he’s saying now is “brains” and is acting super creepy. I’ll keep an eye on it and only use the shovel if necessary”
“Just beat the downpour and headed for base”
“Stop 20879. Found the Texan disconnected from the geophone on top of where we buried it with pieces of bag around it, looked everywhere for the geophone. Found it about 5 m down the hill near the tree line with bite marks all along it. Either an animal dug it up or a very hungry confused thief”
“Team4 is Done! I repeat again, 4 is done! Heading back to the sweet onion city! ☺”
“Team gruesome twosome on our way back to the hub”
“We are gonna skip installing 21520 because both sides of the streets are well maintained yards and there’s not a great place to put a Texan”
“We’re done! Just kidding haha. We’re on our second!”
“All geophones buried --- I am beat. Where’s a can of spinach when ya need one, lol”
“It's a long way to the top if you want to study rocks”
"Sunrise at station 21779"
“We’re dirty but doing well!”
“Still digging. Still have not reached China. Will attempt again on next hole”
“We just deployed our last station, 20224. Can we go to Jekyll Island?”
Donna Shillington, LDEO
Natalie Accardo - Columbia University, LDEO
The SUGAR2 deployment team hails from all across the United States
covering more than 15 states and 21 different universities/institutions.
the science and overview lecture.
near our hotel in Vidalia, Georgia.
Students and PASSCAL personnel take over the instrument center
filling 2,000 Texans with D-cell batteries.
The "battery party" comes to an end as the last Texans are filled and
the boxes are rearranged for easy late-night programming by the PASSCAL team.
Freshly delivered pallets of boxes holding all the science equipment
The PASSCAL team re-arranged the boxes into a T for their own devious reasons :)The trusty Silverado loaded down with 2000 pounds of batteries! (Dan for scale).
Natalie Accardo - LDEO
Steve Harder, Dan Lizarralde, Ashley Nauer, and Galen Kaip
Galen Kaip prepares the source charges (white tubes) on the truck bed as
the drillers complete a shot hole.
The source team carefully lowers the prepared seismic charges into the complete shot hole.
Ashley Nauer (red hat) stands waiting with shovel in hand to fill the remaining height of
the hole with sand and gravel.
The drill team monitors the process of spudding, the very first stage of drilling the
shot hole, for SUGAR line 2.
The source team and drill team push on late into the night to ensure the completion of the
final shot for the entire SUGAR experiment.
Map of SUGAR lines, showing two possible locations of the ancient suture (red dotted lines)
Donna Shillington, LDEO