Chasing Microbes in Antarctica

Subscribe to Chasing Microbes in Antarctica feed
Bowman Lab: Marine microbial ecology
Updated: 48 min 7 sec ago

Microbial community segmentation with R

Thu, 02/02/2017 - 18:39

In my previous post I discussed our recent paper in ISME J, in which we used community structure and flow cytometry data to predict bacterial production.  The insinuation is that if you know community structure, and have the right measure of physiology, you can make a decent prediction of any microbial ecosystem function.  The challenge is that community structure data, which often has hundreds or thousands of dimensions (taxa, OTUs, etc.), is not easily used in straightforward statistical models.   Our workaround is to reduce the community structure data from many dimensions to a single categorical variable represented by a number.  We call this process segmentation.

You could carry out this dimension reduction with pretty much any clustering algorithm; you’re simply grouping samples with like community structure characteristics on the assumption that like communities will have similar ecosystem functions.  We  use the emergent self organizing map (ESOM), a neural network algorithm, because it allows new data to be classified into an existing ESOM.  For example, imagine that you are collecting a continuous time series of microbial community structure data.  You build an ESOM to segment your first few years of data, subsequent samples can be quickly classified into the existing model.  Thus the taxonomic structure, physiological, and ecological characteristics of the segments are stable over time.  There are other benefits to use an ESOM.  One is that with many samples (far more than we had in our study), the ESOM is capable of resolving patterns that many other clustering techniques cannot.

There are many ways to construct an ESOM.  I haven’t tried a Python-based approach, although I’m keen to explore those methods.  For the ISME J paper I used the Kohonen package in R, which has a nice publication that describes some applications and is otherwise reasonably well documented.  To follow this tutorial you can download our abundance table here.  Much of the inspiration, and some of the code for this analysis, follows the (retail) customer segmentation example given here.

For this tutorial you can download a table of the closest estimated genomes and closest completed genomes (analogous to an abundance table) here.  Assuming you’ve downloaded the data into your working directory, fire up Kohonen and build the ESOM.

## Kohonen needs a numeric matrix edge.norm <- as.matrix(read.csv('community_structure.csv', row.names = 1)) ## Load the library library('kohonen') ## Define a grid. The bigger the better, but you want many fewer units in the grid ## than samples. 1:5 is a good ballpark, here we are minimal. som.grid <- somgrid(xdim = 5, ydim=5, topo="hexagonal") ## Now build the ESOM! It is worth playing with the parameters, though in ## most cases you will want the circular neighborhood and toroidal map structure. som.model.edges <- som(edge.norm,                  grid = som.grid,                  rlen = 100,                  alpha = c(0.05,0.01),         = TRUE,                  n.hood = "circular",                  toroidal = T)

Congratulations!  You’ve just constructed your first ESOM.  Pretty easy.  You’ve effectively clustered the samples into the 25 units that define the ESOM.  You can visualize this as such:

plot(som.model.edges, type = 'mapping', pch = 19)

There are the 25 map units, with the toroid split and flattened into 2D.  Each point is a sample (row in the abundance table), positioned in the unit that best reflects its community structure.  I’m not going to go into any depth on the ESOM algorithm, which is quite elegant, but the version implemented in the Kohonen package is based on Euclidean distance.  How well each map unit represents the samples positioned within it is represented by the distance between the map unit and each sample.  This can be visualized with:

plot(som.model.edges, type = 'quality', pch = 19, = topo.colors)

Units with shorter distances in the plot above are better defined by the samples in those units than units with long distances.  What distance is good enough depends on your data and objectives.

The next piece is trickier because there’s a bit of an art to it.  At this point each sample has been assigned to one of the 25 units in the map.  In theory we could call each map unit a “segment” and stop here.  It’s beneficial however, to do an additional round of clustering on the map units themselves.  Particularly on large maps (which clearly this is not) this will highlight major structural features in the data.  Both k-means and hierarchical clustering work fairly well, anecdotally k-means seems to work better with smaller maps and hierarchical with larger maps, but you should evaluate for your data.  Here we’ll use k-means.  K-means requires that you specify the number of clusters in advance, which is always a fun chicken and egg problem.  To solve it we use the within-clusters sum of squares method:

wss.edges <- (nrow(som.model.edges$codes)-1)*sum(apply(som.model.edges$codes,2,var)) for (i in 2:15) {   wss.edges[i] <- sum(kmeans(som.model.edges$codes, centers=i)$withinss) } plot(wss.edges,      pch = 19,      ylab = 'Within-clusters sum of squares',      xlab = 'K')

Here’s where the art comes in.  Squint at the plot and try to decide the inflection point.  I’d call it 8, but you should experiment with whatever number you pick to see if it makes sense downstream.

We can make another plot of the map showing which map units belong to which clusters:

k <- 8 som.cluster.edges <- kmeans(som.model.edges$codes, centers = k) plot(som.model.edges,      main = '',      type = "property",      property = som.cluster.edges$cluster, = topo.colors) add.cluster.boundaries(som.model.edges, som.cluster.edges$cluster)

Remember that the real shape of this map is a toroid and not a square.  The colors represent the final “community segmentation”; the samples belong to map units, and the units belong to clusters.  In our paper we termed these clusters “modes” to highlight the fact that there are real ecological properties associated with them, and that (unlike clusters) they support classification.  To get the mode of each sample we need to index the sample-unit assignments against the unit-cluster assignments.  It’s a little weird until you get your head wrapped around it:

som.cluster.edges$cluster[som.model.edges$unit.classif] [1] 5 7 7 5 2 7 5 3 7 5 2 6 1 1 1 7 5 4 7 7 5 7 7 7 7 7 7 1 4 4 4 4 7 7 7 6 6 6 6 1 1 1 7 5 5 5 1 1 1 5 5 7 7 4 8 7 7 4 7 8 [61] 7 7 7 7 6 5 6 7 7 7 6 4 6 5 4 4 6 2 1 1 1 1 1 4 1 4 4 4

A really important thing to appreciate about these modes is that they are not ordered or continuous.  Mode 4 doesn’t necessarily have more in common with mode 5 say, than with mode 1.  For this reason it is important to treat the modes as factors in any downstream analysis (e.g. in linear modeling).  For our analysis I had a dataframe with bacterial production, chlorophyll concentration, and bacterial abundance, and predicted genomic parameters from paprica.  By saving the mode data as a new variable in the dataframe, and converting the dataframe to a zoo timeseries, it was possible to visualize the occurrence of modes, model the data, and test the pattern of modes for evidence of succession.  Happy segmenting!


Always more

New paper published in ISME Journal

Fri, 01/20/2017 - 19:36

I’m happy to report that a paper I wrote during my postdoc at the Lamont-Doherty Earth Observatory was published online today in the ISME Journal.  The paper, Bacterial community segmentation facilitates the prediction of ecosystem function along the coast of the western Antarctic Peninsula, uses a novel technique to “segment” the microbial community present in many different samples into a few groups (“modes”) that have specific functional, ecological, and genomic attributes.  The inspiration for this came when I stumbled across this blog entry on an approach used in marketing analytics.  Imagine that a retailer has a large pool of customers that it would like to pester with ads tailored to purchasing habits.  It’s too cumbersome to develop an individualized ad based on each customer’s habits, and it isn’t clear what combination of purchasing-habit parameters accurately describe meaningful customer groups.  Machine learning techniques, in this case emergent self-organizing maps (ESOMs), can be used to sort the customers in a way that optimizes their similarity and limits the risk of overtraining the model (including parameters that don’t improve the model).

In a 2D representation of an ESOM, the customers most like one another will be organized in geographically coherent regions of the map.  Hierarchical or k-means clustering can be superimposed on the map to clarify the boundaries between these regions, which in this case might represent customers that will respond similarly to a targeted ad.  But what’s really cool about this whole approach is that, unlike with NMDS or PCA or other multivariate techniques based on ordination, new customers can be efficiently classified into the existing groups.  There’s no need to rebuild the model unless a new type of customer comes along, and it is easy to identify when this occurs.

Back to microbial ecology.  Imagine that you have a lot of samples (in our case a five year time series), and that you’ve described community structure for these samples with 16S rRNA gene amplicon sequencing.  For each sample you have a table of OTUs, or in our case closest completed genomes and closest estimated genomes (CCGs and CEGs) determined with paprica.  You know that variations in community structure have a big impact on an ecosystem function (e.g. respiration, or nitrogen fixation), but how to test the correlation?  There are statistical methods in ecology that get at this, but they are often difficult to interpret.  What if community structure could be represented as a simple value suitable for regression models?

Enter microbial community segmentation.  Following the customer segmentation approach described above, the samples can be segmented into modes based on community structure with an Emergent Self Organizing Map and k-means clustering.  Here’s what this looks like in practice:

From Bowman et al. 2016.  Segmentation of samples based on bacterial community structure.  C-I show the relative abundance of CEGs and CCGs in each map unit.  This value was determined iteratively while the map was trained, and reflects the values for samples located in each unit (B).

This segmentation reduces the data for each sample from many dimensions (the number of CCG and CEG present in each samples) to 1.  This remaining dimension is a categorical variable with real ecological meaning that can be used in linear models.  For example, each mode has certain genomic characteristics:

From Bowman et al. 2016.  Genomic characteristics of modes (a and b), and metabolic pathways associated with taxa that account for most of the variations in composition between modes (d).

In panel a above we see that samples belonging to modes 5 and 7 (dominated by the CEG Rhodobacteraceae and CCG Dokdonia MED134, see Fig. 2 above) have the greatest average number of 16S rRNA gene copies.  Because this is a characteristic of fast growing, copiotrophic bacteria, we might also associate these modes with high levels of bacterial production.

Because the modes are categorical variables we can insert them right into linear models to predict ecosystem functions, such as bacterial production.  Combined with bacterial abundance and a measure of high vs. low nucleic acid bacteria, mode accounted for 76 % of the variance in bacterial production for our samples.  That’s a strong correlation for environmental data.  What this means in practice is; if you know the mode, and you have some flow cytometry data, you can make a pretty good estimate of carbon assimilation by the bacterial community.

For more on what you can do with modes (such as testing for community succession) check out the article!  I’ll post a tutorial on how to segment microbial community structure data into modes using R in a separate post.  It’s easier than you think…

paprica v0.4.0

Sun, 01/08/2017 - 15:04

I’m happy to announce the release of paprica v0.4.0.  This release adds a number of new features to our pipeline for evaluating microbial community and metabolic structure.  These include:

  • NCBI taxonomy information for each point of placement on the reference tree, including internal nodes.
  • Inclusion of the domain Eukarya.  This was a bit tricky and requires some further explanation.

The distribution of metabolic pathways, predicted during the creation of the paprica Eukarya database, across transcriptomes in the MMETSP.

Eukaryotic genomes are a totally different beast than their archaeal and bacterial counterparts.  First and foremost they are massive.  Because of these there aren’t very many completed eukaryotic genomes out there, particularly for singled celled eukaryotes.  While a single investigator can now sequence, assemble, and annotate a bacterial or archaeal genome in very little time, eukaryotic genomes still require major efforts by consortia and lots of $$.

One way to get around this scale problem is to focus on eukaryotic transcriptomes instead of genomes.  Because much of the eukaryotic genome is noncoding this greatly reduces sequencing volume.  Since there is no such thing as a contiguous transcriptome, this approach also implies that no assembly (beyond open reading frames) will be attempted.  The Moore Foundation-funded Marine Microbial Eukaryotic Transcriptome Sequencing Project (MMETSP) was an initial effort to use this approach to address the problem of unknown eukaryotic genetic diversity.  The MMETSP sequenced transcriptomes from several hundred different strains.  The taxonomic breadth of the strains sequenced is pretty good, even if (predictably) the taxonomic resolution is not.  Thus, as for archaea, the phylogenetic tree and metabolic inferences should be treated with caution.  For eukaryotes there are the additional caveats that 1) not all genes coded in a genome will be represented in the transcriptome 2) the database contains only strains from the marine environment and 3) eukaryotic 18S trees are kind of messy.  Considerable effort went into making a decent tree, but you’ve been warned.

Because the underlying data is in a different format, not all genome parameters are calculated for the eukaryotes.  18S gene copy number is not determined (and thus community and metabolic structure are not normalized), the phi parameter, GC content, etc. are also not calculated.  However, eukaryotic community structure is evaluated and metabolic structure inferred in the same way as for the domains bacteria and archaea:

./ test.eukarya eukarya

As always you can install paprica v0.4.0 by following the instructions here, or you can use the virtual box or Amazon Web Service machine instance.

paprica on the cloud

Fri, 11/25/2016 - 22:15

This is a quick post to announce that paprica, our pipeline to evaluate community structure and conduct metabolic inference, is now available as an Amazon Machine Instance (AMI) on the cloud.  The AMI comes with all dependencies required to execute the script pre-installed.  If you want to use it for you’ll have to install pathway-tools and a few additional dependencies.  I’m new to the Amazon EC2 environment, so please let me know if you have any issues using the AMI.

If you are new to Amazon Web Services (AWS) the basic way this works is:

  • Sign up for Amazon EC2 using your normal Amazon log-in
  • From the AWS console, make sure that your region is N. Virginia (community AMI’s are only available in the region they were created in)
  • From your EC2 dashboard, scroll down to “Create Instance” and click “Launch Instance”
  • Now select the “Community AMIs”
  • Search for paprica-ec2, then “Select”
  • Choose the type of instance you would like to run the AMI on.  This is the real power of AWS; you can tailor the instance to the analysis you would like to run.  For testing choose the free t2.micro instance.  This is sufficient to execute the test files or run a small analysis (hundreds of reads).  To use paprica’s parallel features select an instance with the desired number of cores and sufficient memory.
  • Click “Review and Launch”, and finish setting up the instance as appropriate.
  • Log onto the instance, navigate to the paprica directory, execute the test file(s) as described in the paprica tutorial.

Antarctic Long Term Ecological Research

Thu, 10/13/2016 - 12:01

The Palmer and McMurdo Long Term Ecological Research (LTER) projects; separate but equal… at least in terms of interesting ecosystem dynamics if not in terms of biomass!

I’m very excited that our manuscript “Microbial community dynamics in two polar extremes: The lakes of the McMurdo Dry Valleys and the West Antarctic Peninsula Marine Ecosystem” has been published as an overview article in the journal BioScience.  The article belongs to a special issue comparing different ecological aspects of the two NSF-funded Long Term Ecological Research (LTER) sites in Antarctica.  I’m actually writing this post on my return trip from the first ever meeting of the International Long Term Ecological Research (ILTER) network at Kruger National Park in South Africa (an excellent place to ponder ecological questions).

This article had an odd genesis; the special issue was conceived by John Priscu, a PI with the McMurdo LTER project.  I was ensnared in the project along with Trista Vick-Majors, a graduate student with John Priscu (now a postdoctoral scholar at McGill University), shortly after starting my postdoc with Hugh Ducklow, PI on the Palmer LTER project.  The guidance we received was more or less “compare the McMurdo and Palmer LTERs”.  How exactly we should compare perennially ice-covered lakes in a polar desert to one of the richest marine ecosystems on the planet was left up to us.  Fortunately, microbial ecology lends itself to highly reductionist thinking.   This isn’t always helpful, but we reasoned that on a basal level the two ecosystems must function more or less the same.  Despite dramatically different physical settings, both environments host communities of phytoplankton (sometimes even similar taxonomic groups).  These convert solar energy into chemical energy and CO2 into organic carbon, thereby supporting communities of heterotrophic bacteria and grazers.

To look at the details of this we stretched the bounds of what constitutes an “overview article” and aggregated nearly two decades of primary production and bacterial production data collected by the McMurdo LTER, and over a decade of the same from the Palmer LTER.  By looking at the ratio of bacterial production to primary production we assessed how much carbon the heterotrophic bacterial community takes up relative to how much the phytoplankton community produces.

Some stuff

Figure from Bowman et al., 2016, BioScience.  A) Depth-integrated bacterial (BP) and primary production (PP) for the Palmer LTER study area and several lakes in Taylor Valley.  B)  The region occupied by the mean and standard deviation for discrete points (too many points to show).  C) The distribution of BP:PP for each site.

Typical marine values for this ratio are 1:10.  At a value of around 1:5 the carbon demands of heterotrophic bacteria are probably not met by phytoplankton production (the majority of carbon taken up by bacteria is lost through respiration and is not accounted for in the bacterial production assay).  Most of the lakes hover around 1:5, with values above this fairly common.  Lake Fryxell however, an odd lake at the foot of Canada Glacier, has values that often exceed 1:1!  Consistent with previous work on the lakes such high rates of bacterial production (relative to primary production) can only be met by a large external carbon subsidy.

Where does this external carbon come from?  Each summer the McMurdo Dry Valleys warm up enough that the various glaciers at the valley peripheries begin to melt.  This meltwater fuels chemoautotrophic bacterial communities where the glacier meets rock (the subglacial environment), and microbial mats in various streams and melt ponds.  Like microbial communities everywhere these bleed a certain amount of dissolved carbon (and particulate; DOC and POC) into the surrounding water.  Some of this carbon ends up in the lakes where it enhances bacterial production.

But external carbon subsidies aren’t the only part of the story.  Nutrients, namely phosphate and nitrate, are washed into the lakes as well.  During big melt years (such as the summer of 2001-2002 when a major positive SAM coupled to an El Nino caused unusually high temperatures) the lakes receives big pulses of relatively labile carbon but also inorganic nutrients and silt.  This odd combination has the effect of suppressing primary production in the near term through lowered light levels (all that silt), enhancing it in the long term (all those nutrients), and giving heterotrophic bacteria some high quality external carbon to feed on during the period that primary production is suppressed.  Or at least that’s how we read it.

Not a lake person?  How do things work over in the Palmer LTER?  One of the biggest ecological differences between Palmer and McMurdo is that the former has grazers (e.g. copepods, salps, and krill) and the latter does not, or at least not so many to speak off.  Thus an argument can be made that carbon dynamics at Palmer are driven (at least partially) by top-down controls (i.e. grazers), while at McMurdo they are dependent almost exclusively on bottom-up (i.e. chemical and physical) controls.

At times the difference between bacterial production and primary production is pretty extreme at Palmer.  In the summer of 2006 for example, bacterial production was only 3 % of primary production (see Fig. 4 in the publication), and the rate of primary production that summer was pretty high.  The krill population was also pretty high that year; at the top of their 4-year abundance cycle (see Saba et al. 2014, Nature Communications).  This is speculative, but I posit that bacterial production was low in part because a large amount of carbon was being transferred via krill to the higher trophic levels and away from bacteria.  This is a complicated scenario because krill can be good for bacteria; sloppy feeding produces DOC and krill excrete large amounts of reduced nitrogen and DOC.  Krill also build biomass and respire however, and their large fecal pellets sink quickly, these could be significant losses of carbon from the photic zone.

Antarctica is changing fast and in ways that are difficult to predict.  Sea ice seems to be growing in the east Antarctic as it is lost from the west Antarctic, and anomalous years buck this trend in both regions.  A major motivation for this special issue was to explore how the changing environment might drive ecological change.  I have to say that after spending a good portion of the (boreal) summer and fall thinking about this, some of that time from the vantage point of Palmer Station, I have no idea.  All of the McMurdo Lakes react differently to anomalous years, and Palmer as a region seems to react differently to each of abnormal year.  I think the krill story is an important concept to keep in mind here; ecological responses are like superimposed waveforms.  Picture a regularly occurring phenomenon like the El-Nino Southern Oscillation imposing a periodicity on sea ice cover, which we know has a strong influence on biology.  Add a few more oscillating waves from other physical processes.  Now start to add biological oscillations like the four-year krill abundance cycle.  Can we deconvolute this mess to find a signal?  Can we forecast it forward?  Certainly not with 10 years of data at one site and 20 years at the other (and we’re so proud of these efforts!).  Check back next century… if NSF funds these sites that long…

Many thanks to my co-authors for going the distance on this paper, particularly the lake people for many stimulating arguments.  I think limnology and oceanography are, conceptually, much less similar than lakes and oceans.

Creating a landmask for the West Antarctic Peninsula in R

Sun, 09/04/2016 - 08:50

Silicate concentration in µM at 1m depth during the 2014 Palmer LTER cruise.  This plot is lying to you.  The interpolations extend past islets and into landmasses.

This is going to be a pretty niche topic, but probably useful for someone out there.  Lately I’ve been working with a lot of geospatial data for the West Antarctic Peninsula.  One of the things that I needed to do was krig the data (krigging is a form of 2D interpolation, I’m using the pracma library for this).  Krigging is a problem near coastlines because it assumes a contiguous space to work in.  If there happens to be an island or other landmass in the way there is no way to represent the resulting discontinuity in whatever parameter I’m looking at.  Because of this I needed to find a way to mask the output.  This doesn’t really solve the problem, but at least it allows me to identify areas of concern (for example interpolation that extends across an isthmus, if there are sample points only on one side.

I’m krigging and building maps entirely inside R, which has somewhat immature packages for dealing with geospatial data.  The easiest masking solution would be to use filled polygons from any polygon format shapefile that accurately represents the coastline.  Unfortunately I couldn’t find an R package that does this correctly with the shapefiles that I have access too.  In addition, because of my downstream analysis it was better to mask the data itself, and not just block out landmasses in the graphical output.

Sharon Stammerjohn at the NSIDC pointed me to the excellent Bathymetry and Global Relief dataset produced by NOAA.  This wasn’t a complete solution to the problem but it got me moving in the right direction.  From the custom grid extractor at I selected a ETOPO1 (bedrock) grid along the WAP, with xyz as the output format.  If you’re struggling with shapefiles the xyz format is like a cool drink of water, being a 3-column matrix of longitude, latitude, and height (or depth).  For the purpose of creating the mask I considered landmass as any lat-long combination with height > 0.

There is one more really, really big twist to what I was trying to do, however.  The Palmer LTER uses a custom 1 km pixel grid instead of latitude-longitude.  It’s a little easier to conceptualize than lat-long given the large longitude distortions at high latitude (and the inconvenient regional convergence of lat-long values on similar negative numbers).  It is also a little more ecologically relevant, being organized parallel to the coastline instead of north to south.  Unfortunately this makes the grid completely incompatible with other Euclidean reference systems such as UTM.  So before I could use my xyz file to construct a land mask I needed to convert it to the line-station grid system used by the Palmer LTER.  If you’re working in lat-long space you can skip over this part.


The Palmer LTER grid provides a convenient geospatial reference for the study area, but converting between line (y) and station (x) coordinates and latitude-longitude is non-trivial.

Many moons ago someone wrote a Matlab script to convert lat-long to line-station which you can find here.  Unfortunately I’m not a Matlab user, nor am I inclined to become one.  Fortunately it was pretty straightforward to copy-paste the code into R and fix the syntatic differences between the two languages.  Three functions in total are required:

## AUTHORS OF ORIGINAL MATLAB SCRIPT: #   Richard A. Iannuzzi #   Lamont-Doherty Earth Observatory # #   based on: LTERGRID program written by Kirk Waters (NOAA Coastal Services Center), February 1997 ## some functions that are used by the main function SetStation <- function(e, n, CENTEREAST, CENTERNORTH, ANGLE){   uu = e - CENTEREAST   vv = n - CENTERNORTH   z1 = cos(ANGLE)   z2 = sin(ANGLE)   NorthKm = (z1 * uu - z2 *vv) / 1000 + 600   EastKm = (z2 * uu + z1 * vv) / 1000 + 40     return(c(NorthKm, EastKm)) } CentralMeridian <- function(iz){   if(abs(iz) > 30){     iutz = abs(iz) - 30     cm = ((iutz * 6.0) -3.0) * -3600   }   else{     iutz = 30 - abs(iz)     cm = ((iutz * 6.0) +3.0) * +3600   }   return(cm) } GeoToUTM <- function(lat, lon, zone){   axis = c(6378206.4,6378249.145,6377397.155,           6378157.5,6378388.,6378135.,6377276.3452,           6378145.,6378137.,6377563.396,6377304.063,           6377341.89,6376896.0,6378155.0,6378160.,           6378245.,6378270.,6378166.,6378150.)     bxis = c(6356583.8,6356514.86955,6356078.96284,           6356772.2,6356911.94613,6356750.519915,6356075.4133,           6356759.769356,6356752.31414,6356256.91,6356103.039,           6356036.143,6355834.8467,6356773.3205,6356774.719,           6356863.0188,6356794.343479,6356784.283666,           6356768.337303)     ak0 = 0.9996   radsec = 206264.8062470964     sphere = 9     a = axis[sphere - 1]             # major axis size   b = bxis[sphere - 1]             # minior axis size   es = ((1-b^2/a^2)^(1/2))^2      # eccentricity squared   slat = lat * 3600                # latitude in seconds   slon = -lon * 3600               # longitude in seconds   cm = 0                           # central meridian in sec   iutz = 0     cm = CentralMeridian(zone)       # call the function     phi = slat/radsec   dlam = -(slon - cm)/radsec   epri = es/(1.0 - es)   en = a/sqrt(1.0 - es * sin(phi)^2)   t = tan(phi)^2   c = epri * cos(phi)^2   aa = dlam * cos(phi)   s2 = sin(2.0 * phi)   s4 = sin(4.0 * phi)   s6 = sin(6.0 * phi)   f1 = (1.0 - (es/4.)-(3.0*es*es/64.)-(5.0*es*es*es/256))   f2 = ((3*es/8)+(3.0*es*es/32)+(45*es*es*es/1024))   f3 = ((15*es*es/256)+(45*es*es*es/1024))   f4 = (35*es*es*es/3072)   em = a*(f1*phi - f2*s2 + f3*s4 - f4*s6)   xx = ak0 * en * (aa + (1.-t+c) * aa^3/6 + (5 - 18*t + t*t + 72*c-58*epri)* aa^5/120) + 5e5   yy = ak0 * (em + en * tan(phi) *((aa*aa/2) + (5-t+9*c+4*c*c)*aa^4/24 + (61-58*t +t*t +600*c - 330*epri)* aa^6/720))     if(zone < 0 | slat < 0){     yy = yy + 1e7   }     return(c(xx, yy)) } ## This function actually works with your data ll2gridLter <- function(inlat, inlon){   NorthKm = 0           # initialize   EastKm = 0            # initialize   zone = -20            # set zone (for LTER region, I think)   ANGLE = -50 * pi / 180   CENTEREAST = 433820.404        # eastings for station 600.040   CENTERNORTH = 2798242.817     # northings for station 600.040     # take latitude longitude and get station   x.y = GeoToUTM(inlat, inlon, zone)   NorthKm.EastKm = SetStation(x.y[1], x.y[2], CENTEREAST, CENTERNORTH, ANGLE)   return(NorthKm.EastKm) }

Once the functions are defined I used them to convert the lat/long coordinates in the xyz file to line-station.

## Read in xyz file. lat.long.depth <- read.table('', header = F, col.names = c('long', 'lat', 'depth')) ## Limit to points above sea level. <- lat.long.depth[which(lat.long.depth$depth >= 0),] ## Create a matrix to hold the output. <- matrix(ncol = 3, nrow = length($long)) colnames(line.station.depth) <- c('line', 'station', 'depth') ## Execute the ll2gridLter function on each point. Yes, I'm using a loop to do this. for(i in 1:length($long)){[i,] <- c(ll2gridLter($lat[i],$long[i]),$depth[i])   print(paste(c(i,[i,]))) } ## Write out the matrix. write.csv(, 'palmer_grid_landmask.csv', row.names = F, quote = F)

At this point I had a nice csv file with line, station, and elevation.  I was able to read this into my existing krigging script and convert into a mask.

## Read in csv file. landmask <- read.csv('palmer_grid_landmask.csv') ## Limit to the lines and stations that I'm interested in. landmask <- landmask[which(landmask[,1] <= 825 & landmask[,1] >= -125),] landmask <- landmask[which(landmask[,2] <= 285 & landmask[,2] >= -25),] ## Interpolated data is at 1 km resolution, need to round off ## to same resolution here. landmask.expand <- cbind(ceiling(landmask[,1]), ceiling(landmask[,2])) ## Unfortunately this doesn't adequately mask the land. Need to expand the size of each ## masked pixel 1 km in each direction. landmask.expand <- rbind(landmask.expand, cbind(floor(landmask[,1]), floor(landmask[,2]))) landmask.expand <- rbind(landmask.expand, cbind(ceiling(landmask[,1]), floor(landmask[,2]))) landmask.expand <- rbind(landmask.expand, cbind(floor(landmask[,1]), ceiling(landmask[,2]))) landmask.expand <- unique(landmask.expand)

I’m not going to cover how I did the krigging in this post.  My krigged data is in matrix called temp.pred.matrix with colnames given by ‘x’ followed by ‘station’, as in x20 for station 20, and row names ‘y’ followed by ‘line’, as in y100 for line 100.  To convert interpolated points that are actually land to NA values I simply added this line to my code:

temp.pred.matrix[cbind(paste0('y', landmask.expand[,1]), paste0('x', landmask.expand[,2] * -1))]

Here’s what the krigged silicate data looks like after masking.


Silicate concentration in µM at 1m depth during the 2014 Palmer LTER cruise after masking the underlying data.

Excellent.  The masked area corresponds with known landmasses; that’s Adelaide Island (home of Rothera Station) at the bottom of the Peninsula, and various islets and the western edge of the Antarctic Peninsula to the northeast.  At this point erroneous data has been eliminated from the matrix.  Annual inventories and such can be more accurately calculated form the data and our eyes are drawn to interesting features in the interpolation that have no chance of reflecting reality because they are over land.  The white doesn’t look that appealing to me in this plot however, so I masked the land with black by adding points to the plot.  Again, I’m not going to show the whole plotting routine because some variables called would require a lengthy explanation about how the larger data is structured.  The plot was created using imagep in the oce package.  This command automatically transposes the matrix.

## Show masked points as black squares. points(landmask.expand[,1] ~ {-1 * landmask.expand[,2]}, pch = 15, cex = 0.6)

And the final plot:


Silicate concentration in µM at 1m depth during the 2014 Palmer LTER cruise after masking the underlying data and adding points to indicate the masked area.



Astrobiology Primer v2

Sat, 08/20/2016 - 23:16

The long-awaited version 2 of the Astrobiology Primer was published (open access) yesterday in the journal Astrobiology.  I’m not sure who first conceived of the Astrobiology Primer, first published in 2006, but the v2 effort was headed by co-lead editors Shawn Domagal-Goldman at NASA and Katherine Wright  at the UK Space Agency.  The Primer v2 was a long time in coming; initial section text was submitted back in early 2011!  The longer these projects go on, the easy it is for them to die.  Many thanks to Shawn and Katherine for making sure that this didn’t happen!

The downside of course, is that the primer ran the risk of being outdated before it even went into review.  This was mitigated somewhat by the review process itself, and authors did have a chance to update their various sections.  Some sections are more stable than others; the section that I wrote with Shawn McGlynn (now at the Tokyo Institute of Technology) on how life uses energy for example, covers some fairly fundamental ground and is likely to stand the test of time.  Less so for sections that cover planetary processes in and outside of our solar system; paradigms are being broken in planetary science as fast as they form!

The Astrobiology Primer is a very valuable document because it takes a complicated and interdisciplinary field of study and attempts to summarize it for a broad audience.  Most of the Primer should be accessible to anyone with a basic grasp of science.  I wonder if it could even serve as a model for other disciplines.  What if the junior scientists in every discipline (perhaps roughly defined by individual NSF or NASA programs) got together once every five years to write an open-access summary of the major findings in their field?  This might provide a rich and colorful counterpoint to the valuable but often [dry, esoteric, top-down?  There’s an adjective that I’m searching for here but it escapes me] reports produced by the National Academies.

The co-lead editors were critical to the success of the Primer v2.  I haven’t explicitly asked Shawn and Kaitlin if they were compensated in any way for this activity – perhaps they rolled some of this work into various fellowships and such over the years.  More likely this was one more extracurricular activity carried out on the side.  Such is the way science works, and the lines are sufficiently blurred between curricular and extracurricular that most of us don’t even look for them anymore.  In recognition of this, and to speed the publication and heighten the quality of a future Primer v3, it would be nice to see NASA produce a specific funding call for a (small!) editorial team.  Three years of partial salary and funding for a series of writing workshops would make a huge difference in scope and quality.

How I learned to stop worrying and love subsampling (rarifying)

Tue, 07/12/2016 - 12:53

I have had the 2014 paper “Waste Not, Want Not: Why Rarefying Microbiome Data is Inadmissable” by McMurdie and Holmes sitting on my desk for a while now.  Yesterday I finally got around to reading it and was immediately a little skeptical as a result of the hyperbole with which they criticized the common practice of subsampling* libraries of 16S rRNA gene reads during microbial community analysis.  The logic of that practice proceeds like this:

To develop 16S rRNA gene data (or any other marker gene data) describing a microbial community we generally collect an environmental sample, extract the DNA, amplify it, and sequence the amplified material.  The extract might contain the DNA from 1 billion microbial cells present in the environment.  Only a tiny fraction (<< 1 %) of these DNA molecules actually get sequenced; a “good” sequence run might contain only tens of thousands of sequences per sample.  After quality control and normalizing for multiple copies of the 16S rRNA gene it will contain far fewer.  Thus the final dataset contains a very small random sample of the original population of DNA molecules.

Most sequence-based microbial ecology studies involve some kind of comparison between samples or experimental treatments.  It makes no sense to compare the abundance of taxa between two datasets of different sizes, as the dissimilarity between the datasets will appear much greater than it actually is.  One solution is to normalize by dividing the abundance of each taxa by the total reads in the dataset to get their relative abundance.  In theory this works great, but has the disadvantage that it does not take into account that a larger dataset has sampled the original population of DNA molecules deeper.  Thus more rare taxa might be represented.  A common practice is to reduce the amount of information present in the larger dataset by subsampling to the size of the smaller dataset.  This attempts to approximate the case where both datasets undertake the same level of random sampling.

McMurdie and Holmes argue that this approach is indefensible for two common types of analysis; identifying differences in community structure between multiple samples or treatments, and identifying differences in abundance for specific taxa between samples and treatments.  I think the authors do make a reasonable argument for the latter analysis; however, at worst the use of subsampling and/or normalization simply reduces the sensitivity of the analysis.  I suspect that dissimilarity calculations between treatments or samples using realistic datasets are much less sensitive to reasonable subsampling than the authors suggest.  I confess that I am (clearly) very far from being a statistician and there is a lot in McMurdie and Holmes, 2014 that I’m still trying to digest.  I hope that our colleagues in statistics and applied mathematics continue optimizing these (and other) methods so that microbial ecology can improve as a quantitative science.  There’s no need however, to get everyone spun up without a good reason.  To try and understand if there is a good reason, at least with respect to my data and analysis goals, I undertook some further exploration.  I would strongly welcome any suggestions, observations, and criticisms to this post!

My read of McMurdi and Homes is that the authors object to subsampling because it disregards data that is present that could be used by more sophisticated (i.e. parametric) methods to estimate the true abundance of taxa.  This is true; data discarded from larger datasets does have information that can be used to estimate the true abundance of taxa among samples.  The question is how much of a difference does it really make?  McMurdi and Holmes advocate using methods adopted from transcriptome analysis.  These methods are necessary for transcriptomes because 1) the primary objective of the study is not usually to see if one transcriptome is different from another, but which genes are differentially expressed and 2) I posit that the abundance distribution of transcript data is different than the taxa abundance distribution.  An example of this can be seen in the plots below.

Taxon abundance, taken from a sample that I’m currently working with.













Transcript abundance by PFAM, taken from a sample (selected at random) from the MMETSP.

In both the cases the data is log distributed, with a few very abundant taxa or PFAMs and many rare ones.  What constitutes “abundant” in the transcript dataset however, is very different than “abundant” in the community structure dataset.  The transcript dataset is roughly half the size (n = 9,000 vs. n = 21,000), nonetheless the most abundant transcript has an abundance of 209.  The most abundant OTU has an abundance of 6,589, and there are several very abundant OTUs.  Intuitively this suggests to me that the structure of the taxon dataset is much more reproducible via subsampling than the structure of the transcript dataset, as the most abundant OTUs have a high probability of being sampled.  The longer tail of the transcript data contributes to this as well, though of course this tail is controlled to a large extent by the classification scheme used (here PFAMs).

To get an idea of how reproducible the underlying structure was for the community structure and transcript data I repeatedly subsampled both (with replacement) to 3000 and 1300 observations, respectively.  For the community structure data this is about the lower end of the sample size I would use in an actual analysis – amplicon libraries this small are probably technically biased and should be excluded (McMurdi and Holmes avoid this point at several spots in the paper).  The results of subsampling are shown in these heatmaps, where each row is a new subsample.  For brevity only columns summing to an abundance > 100 are shown.

otu_abundance_heat trans_abundance_heatIn these figures the warmer colors indicate a higher number of observations for that OTU/PFAM.  The transcript data is a lot less reproducible than the community structure data; the Bray-Curtis dissimilarity across all iterations maxes out at 0.11 for community structure and 0.56 for the transcripts.  The extreme case would be if the data were normally distributed (i.e. few abundant and few rare observations, many intermediate observations).  Here’s what subsampling does to normally distributed data (n = 21,000, mean = 1000, sd = 200):


If you have normally distributed data don’t subsample!

For the rest of us it seems that for the test dataset used here community structure is at least somewhat reproducible via subsampling.  There are differences between iterations however, what does this mean in the context of the larger study?

The sample was drawn from a multiyear time series from a coastal marine site.  The next most similar sample (in terms of community composition and ecology) was, not surprisingly, a sample taken one week later.  By treating this sample in an identical fashion, then combining the two datasets, it was possible to evaluate how easy it is to tell the two samples apart after subsampling.  In this heatmap the row colors red and black indicate iterations belonging to the two samples:

Clustering of repeated subsamplings from two similar samples.  Sample identity is given by the red or black color along the y-axis.


As this heatmap shows, for these two samples there is perfect fidelity.  Presumably with very similar samples this would start to break down, determining how similar samples need to be before they cannot be distinguished at a given level of subsampling would be a useful exercise.  The authors attempt to do this in Simlation A/Figure 5 in the paper, but it isn’t clear to me why their results are so poor – particularly given very different sample types and a more sophisticated clustering method than I’ve applied here.

As a solution – necessary for transcript data, normally distributed data, or for analyses of differential abundance, probably less essential for comparisons of community structure – the authors propose a mixture model approach that takes in account variance across replicates to estimate “real” OTU abundance.  Three R packages that can do this are mentioned; edgeR, DESeq, and metagenomeSeq.  The problem with these methods – as I understand them – is that they require experimental replicates.  According to the DESeq authors, technical replicates should be summed, and samples should be assigned to treatment pools (e.g. control, treatment 1, treatment 2…).  Variance is calculated within each pool and this is used to to model differences between pools.  This is great for a factor-based analysis, as is common in transcriptome analysis or human microbial ecology studies.  If you want to find a rare, potentially disease-causing strain differently present between a healthy control group and a symptomatic experimental group for example, this is a great way to go about it.

There are many environmental studies for which these techniques are not useful however, as it may be impractical to collect experimental replicates.  For example it is both undesirable and impractical to conduct triplicate or even duplicate sampling in studies focused on high-resolution spatial or temporal sampling.  Sequencing might be cheap now, but time and reagents are not.  Some of the methods I’m working with are designed to aggregate samples into higher-level groups – at which point these methods could be applied by treating within-group samples as “replicates” – but this is only useful if we are interested in testing differential abundance between groups (and doesn’t solve the problem of needing to get samples grouped in the first place).

These methods can be used to explore differential abundance in non-replicated samples, however, they are grossly underpowered when used without replication.  Here’s an analysis of differential abundance between the sample in the first heatmap above and its least similar companion from the same (ongoing) study using DESeq.  You can follow along with any standard abundance table where the rows are samples and the columns are variables.

library(DESeq) ## DESeq wants to oriented opposite how community abundance ## data is normally presented (e.g. to vegan) data.dsq <- t(data) ## DESeq requires a set of conditions which are factors. Ideally ## this would be control and treatment groups, or experimental pools ## or some such, but we don't have that here. So the conditions are ## unique column names (which happen to be dates). conditions <- factor(as.character(colnames(data.dsq))) ## As a result of 16S rRNA gene copy number normalization abundance ## data is floating point numbers, convert to integers. data.dsq <- ceiling(data.dsq, 0) ## Now start working with DESeq. data.ct <- newCountDataSet(, conditions = conditions) data.size <- estimateSizeFactors(data.ct) ## This method and sharing mode is required for unreplicated samples. data.disp <- estimateDispersions(data.size, method = 'blind', sharingMode="fit-only") ## And now we can execute a test of differential abundance for the ## two samples used in the above example. test <- nbinomTest(data.disp, '2014-01-01', '2014-01-06') test <- na.omit(test) ## Plot the results. plot(test$baseMeanA, test$baseMeanB,      #log = 'xy',      pch = 19,      ylim = c(0, max(test$baseMeanB)),      xlim = c(0, max(test$baseMeanA)),      xlab = '2014-01-01',      ylab = '2009-12-14') abline(0, 1) ## If we had significant differences we could then plot them like this: points(test$baseMeanA[which(test$pval < 0.05)],        test$baseMeanB[which(test$pval < 0.05)],        pch = 19,        col = 'red')



As we would expect there are quite a few taxa present in high abundance in one sample and not the other, however, none of the associated p-values are anywhere near significant.  I’m tempted to try to use subsampling to create replicates, which would allow an estimate of variance across subsamples and access to greater statistical power.  This is clearly not as good as real biological replication, but we have to work within the constraints of our experiments, time, and funding…

*You might notice that I’ve deliberately avoided using the terms “microbiome” and “rarefying” here.  In one of his comics Randall Munroe asserted that the number of made up words in a book is inversely proportional to the quality of the book, similarly I strongly suspect that the validity of a sub-discipline is inversely proportional to the number of jargonistic words that community makes up to describe its work.  As a member of said community, what’s wrong with subsampling and microbial community??

Exploring genome content and genomic character with paprica and R

Thu, 07/07/2016 - 13:07

The paprica pipeline was designed to infer the genomic content and genomic characteristics of a set of 16S rRNA gene reads.  To enable this the paprica database organizes this information by phylogeny for many of the completed genomes in Genbank.  In addition to metabolic inference this provides an opportunity to explore how genome content and genomic characteristics are organized phylogenetically.  The following is a brief analysis of some genomic features using the paprica database and R.  If you aren’t familiar with the paprica database this exercise will also familiarize you with some of its content and its organization.

The paprica pipeline and database can be obtained from Github here.  In this post I’ll be using the database associated with version 0.3.1.  The necessary files from the bacteria database (one could also conduct this analysis on the much smaller archaeal database) can be read into R as such:

## Read in the pathways associated with the terminal nodes on the reference tree path <- read.csv('paprica/ref_genome_database/bacteria/terminal_paths.csv', row.names = 1) path[] <- 0 ## Read in the data associated with all completed genomes in Genbank data <- read.csv('paprica/ref_genome_database/bacteria/', row.names = 1) ## During database creation genomes with duplicate 16S rRNA genes were removed, ## so limit to those that were retained data <- data[row.names(data) %in% row.names(path),] ## "path" is ordered by clade, meaning it is in top to bottom order of the reference tree, ## however, "data" is not, so order it here data <- data[order(data$clade),]

One fun thing to do at this point is to look at the distribution of metabolic pathways across the database.  To develop a sensible view it is best to cluster the pathways according to which genomes they are found in.

## The pathway file in the database is binary, so we use Jaccard for distance library('vegan') path.dist <- vegdist(t(path), method = 'jaccard') # distance between pathways (not samples!) path.clust <- hclust(path.dist)

The heatmap function is a bit cumbersome for this large matrix, so the visualization can be made using the image function.

## Set a binary color scheme image.col <- colorRampPalette(c('white', 'blue'))(2) ## Image will order matrix in ascending order, which is not what we want here! image(t(data.matrix(path))[rev(path.clust$order),length(row.names(path)):1],       col = image.col,       ylab = 'Genome',       xlab = 'Pathway',       xaxt = 'n',       yaxt = 'n') box()
The distribution of metabolic pathways across all 3,036 genomes in the v0.3.1 paprica database.

The distribution of metabolic pathways across all 3,036 genomes in the v0.3.1 paprica database.

There are a couple of interesting things to note in this plot.  First, we can see the expected distribution of core pathways present in nearly all genomes, and the interesting clusters of pathways that are unique to a specific lineage.  For clarity row names have been omitted from the above plot, but from within R you can pull out the taxa or pathways that interest you easily enough.  Second, there are some genomes that have very few pathways.  There are a couple of possible reasons for this that can be explored with a little more effort.  One possibility is that these are poorly annotated genomes, or at least the annotation didn’t associate many or any coding sequences with either EC numbers or GO terms – the two pieces of information Pathway-Tools uses to predict pathways during database construction.  Another possibility is that these genomes belong to obligate symbionts (either parasites or beneficial symbionts).  Obligate symbionts often have highly streamlined genomes and few complete pathways.  We can compare the number of pathways in each genome to other genome characteristics for additional clues.

A reasonable assumption is that the number of pathways in each genome should scale with the size of the genome.  Large genomes with few predicted pathways might indicate places where the annotation isn’t compatible with the pathway prediction methods.

## Plot the number of pathways as a function of genome size plot(rowSums(path) ~ data$genome_size,      ylab = 'nPaths',      xlab = 'Genome size') ## Plot P. ubique as a reference point select <- grep('Pelagibacter ubique HTCC1062', data$organism_name) points(rowSums(path)[select] ~ data$genome_size[select],        pch = 19,        col = 'red')
The number of metabolic pathways predicted as a function of genome size for the genomes in the paprica database.

The number of metabolic pathways predicted as a function of genome size for the genomes in the paprica database.

That looks pretty good.  For the most part more metabolic pathways were predicted for larger genomes, however, there are some exceptions.  The red point gives the location of Pelagibacter ubique HTCC1062.  This marine bacterium is optimized for life under energy-limited conditions.  Among its adaptations are a highly efficient and streamlined genome.  In fact it has the smallest genome of any known free-living bacterium.  All the points below it on the x-axis are obligate symbionts; these are dependent on their host for some of their metabolic needs.  There are a few larger genomes that have very few (or even no) pathways predicted.  These are the genomes with bad, or at least incompatible annotations (or particularly peculiar biochemistry).

The other genome parameters in paprica are the number of coding sequences identified (nCDS), the number of genetic elements (nge), the number of 16S rRNA gene copies (n16S), GC content (GC), and phi; a measure of genomic plasticity.  We can make another plot to show the distribution of these parameters with respect to phylogeny.

## Grab only the data columns we want <- data[,c('n16S', 'nge', 'ncds', 'genome_size', 'phi', 'GC')] ## Make the units somewhat comparable on the same scale, a more ## careful approach would log-normalize some of the units first <- decostand(, method = 'standardize') <- decostand(, method = 'range') ## Plot with a heatmap heat.col <- colorRampPalette(c('blue', 'white', 'red'))(100) heatmap(data.matrix(,       margins = c(10, 20),       col = heat.col,       Rowv = NA,       Colv = NA,       scale = NULL,       labRow = 'n',       cexCol = 0.8)
Genomic parameters organized by phylogeny.

Genomic parameters organized by phylogeny.

Squinting at this plot it looks like GC content and phi are potentially negatively correlated, which could be quite interesting.  These two parameters can be plotted to get a better view:

plot($phi ~$GC,      xlab = 'GC',      ylab = 'phi')
The phi parameter of genomic plasticity as a function of GC content.

The phi parameter of genomic plasticity as a function of GC content.

Okay, not so much… but I think the pattern here is pretty interesting.  Above a GC content of 50 % there appears to be no relationship, but these parameters do seem correlated for low GC genomes.  This can be evaluated with linear models for genomes above and below 50 % GC.

gc.phi.above50 <- lm($phi[which($GC >= 50)] ~$GC[which($GC >= 50)]) gc.phi.below50 <- lm($phi[which($GC < 50)] ~$GC[which($GC < 50)]) summary(gc.phi.above50) summary(gc.phi.below50) plot($phi ~$GC,      xlab = 'GC',      ylab = 'phi',      type = 'n') points($phi[which($GC >= 50)] ~$GC[which($GC >= 50)],        col = 'blue') points($phi[which($GC < 50)] ~$GC[which($GC < 50)],        col = 'red') abline(gc.phi.above50,        col = 'blue') abline(gc.phi.below50,        col = 'red') legend('bottomleft',        bg = 'white',        legend = c('GC >= 50',                   'GC < 50'),        col = c('blue', 'red'),        pch = 1)

Genomic plasticity (phi) as a function of GC content for all bacterial genomes in the paprica database.

As expected there is no correlation between genomic plasticity and GC content for the high GC genomes (R2 = 0) and a highly significant correlation for the low GC genomes (albeit with weak predictive power; R2 = 0.106, p = 0).  So what’s going on here?  Low GC content is associated with particular microbial lineages but also with certain ecologies.  The free-living low-energy specialist P. ubique HTCC1062 has a low GC content genome for example, as do many obligate symbionts regardless of their taxonomy (I don’t recall if it is known why this is).  Both groups are associated with a high degree of genomic modification, including genome streamlining and horizontal gene transfer.

The 6 cent speeding ticket

Fri, 06/17/2016 - 11:51

I’m going to go way of the normal track here and do a bit of social commentary.  I heard a radio piece on my drive home yesterday about the challenge of paying court and legal fees for low income wage earners.  This can trap those guilty of minor offenses (like a traffic infraction) in a cycle of jail and debt that is difficult to break out of.  It never made sense to me that financial penalties – which by their nature are punitive – don’t scale with income, as is common in some European countries.  I decided to try and visualize how the weight of a penalty for say, a speeding ticket, scales with income.  It was tempting to try and scale up, i.e. what is the equivalent of $300 for someone earning $X?  I decided however, that it would be more informative to try and scale down.

Cost equivalent in minimum wage earner dollars of a $300 penalty for individuals at different income levels. Red text is the cost equivalent (y-axis value). X-axis (income) is on a log scale. See text for income data sources.

In this scenario the penalty is $300.  Someone earning the federal minimum wage and working 40 hours a week makes $15,080/year, so the $300 penalty is roughly 2 % of annual income.  So be it, perhaps that’s a fair penalty.  But what would be the equivalent if the same offender earned more?  A private/seaman/airman who has just joined the military earns roughly $18,561/year.  Paying the same ticket (and I know from experience that the military pays a lot of them) would equate to the minimum wage earner paying $243.72.  A graduate student fortunate enough to get a stipend (and own a car) might earn $25,000/year.  Paying the same ticket would be equivalent to the lowest wage earner paying $180.96, and down it goes along the income scale.  If LeBron James, who earned $77.2 million last year in salary and endorsements (according to Forbes), got the ticket, the penalty would be equivalent to the lowest income wage earner paying $0.06.  Salary data came from a variety of sources, including here and here.  Salaries marked with an asterisk in the plot above are medians from these sources.

The Database Dilemma

Wed, 05/25/2016 - 12:31

Microbial ecologists know they have a problem with data archiving, particularly when it comes to sequence data.  I’m not entirely sure why this is the case; in theory it would be pretty straightforward to build a database searchable by the parameters ecologists are interested in.  I suspect that part of the problem is a lack of interest on the part of NSF (the primary source of funding for ecology) in designing and maintaining databases, and the perception that this is duplicative of the (rather sad) efforts of the NIH-funded NCBI in this area.  I don’t want to be too disparaging of NCBI – they have a much harder job than the simple database I’ve outlined above – but the Sequence Read Archive (SRA) and other NCBI repositories are a near-total disaster.  We’re wasting a lot of collective time, money, and effort developing sequence data that can’t be used for future studies because it can’t be easily accessed.

In response to this a small number of alternate repositories have popped up.  In my experience however, they don’t offer much of an improvement.  This isn’t meant as a harsh criticism of various peoples’ good efforts, I just don’t think the community’s found a viable model yet.  One example is the VAMPS database operated by the Marine Biological Laboratory, which in addition to having an unfortunate name (I believe it was named before the onset of the angsty-teen vampire thing, try googling “VAMPS” and you’ll see what I mean), tries to do too much.  Rather than organizing data and making it searchable the designers opted to try and build an online analysis pipeline.  Instead of easy to find data you end up with an analytical black box and canned results.

The reason for all this soap-boxing is my experience this week trying to obtain some global marine 16S rRNA gene amplicon data for a project that I’m working on.  The data I had hoped to obtain was used in the 2014 PNAS paper Marine bacteria exhibit a bipolar distribution.  My difficulty in obtaining the data highlights failures at the author, reviewer, journal, and repository levels.  I have great respect for many of the authors on this paper, so again this isn’t meant as a harsh criticism (I’m sure I haven’t made data I’ve published sufficiently accessible either), but rather a lesson in something that we all need to do better at.  I’ll end with the Python-based solution I came up with for getting useful sequence data out of the VAMPS flat-file format.

The 2014 paper used 277 samples that were collected as part of the International Census of Marine Microbes (ICOMM).  There are quite a few things one can say about that particular endeavor but we’ll save it for another post.  The authors included the accession numbers for all these studies in a supplementary file.  This file is a pdf, so not the easiest thing to parse, but at least the text is renderable, so the (7-page) table can be copied into a text file and parsed further.  With a little bit of effort you can get a single-column text file containing the accession numbers for each of these studies.  Great.  Unfortunately a good number of these are wildly incorrect, and most of the remainder point to SRA objects that can’t be recognized by the woefully inadequately documented SRA toolkit.  Here are some examples:

SRX011052 – Enter this accession number into the SRA search bar and it will navigate you to a page for a sample with the helpful name “Generic sample from marine metagenome”.  Nevermind for a moment that “metagenome” is an inappropriate name for an amplicon dataset, where is the data?  Click the link for “Run” (with a different accession number: SRR027219) and navigate to a new page with lots of obscure information you don’t care about.  If you’re actually doing this you might, in near desperation, click the “Download” tab in the hope that this gets you somewhere.  Psych!  All that page has is a message that you need to use the SRA toolkit to download this data.  Let’s try the prefetch utility in the toolkit:

>prefetch SRR027219 2016-05-25T15:30:43 prefetch.2.4.2 err: path not found while resolving tree within virtual file system module - 'SRR027219.sra' cannot be found.

Whatever that means.  There’s a trick to using this utility, which I’ll post as soon as I remember what it is.  The documentation suggests that the above command should work.  The absurdity of needing to rely on this utility in place of FTP usually drives me to using the EMBL-EBI site instead.  This is the Euro equivalent of Genbank, anything submitted to one should be accessible through the other.  In general EMBL-EBI is much better organized and more user friendly than SRA.

Entering “SRX011052” into the ENBL-EBI search bar returns some results that also point us to SRR027219.  The page for SRR027219 has such helpful things as an FTP link for direct download as well as some basic info on the sample.  That’s great, but because the authors opted to provide “Experiment” accession numbers for some samples (e.g. SRX011052) there is no way that I can see to generate a script to automate downloads, as we have no way of knowing the run accession numbers.  The solution is to tediously follow the links to all 277 runs.  I actually started doing this which lead me to my second problem.

SRP001129 – At some point in the supplementary table the accession numbers switch from a “SRX” prefix to “SRP”.  Picking one of these at random and entering it into the EMBL-EBI search bar returns a result for a genome sequencing project related to the Human Microbiome Project.  Clearly this isn’t from the ICOMM project.  Roughly a third of the accession numbers reported in the paper aren’t correct!

Clearly I wasn’t going to get the data that I needed through either EMBL-EBI or the SRA.  Knowing that the study had been part of the ICOMM I turned to the MBL VAMPS database where the ICOMM data is also archived.  Thinking that I was on the home stretch I followed the link to the ICOMM portal from the VAMPS homepage.  The right side column gave me two options for download; tar-zipped sff files or a tab-delimited file.  Not wanting to deal with converting from sff (the organic 454 file format) I took a look at the tab-delim option.  The top lines of the file look like this:

project dataset sequence taxonomy knt rank total_knt frequency ICM_ABR_Av6 ABR_0006_2005_01_07 GAAACTCACCAAGGCCGACTGTTAGACGAAGATCAGCGTAATGAGCTTATCGGATTTTCAGAGAG Archaea;Euryarchaeota;Thermoplasmata;Thermoplasmatales;Marine_Group_II 1 family 36172 2.7645693e-05 ICM_ABR_Av6 ABR_0006_2005_01_07 GAAACTCACCAAGGCCGACTGTTAGATGAAGATCAGCGTAATGAGCTTATCGGATTTTCAGA Archaea;Euryarchaeota;Thermoplasmata;Thermoplasmatales;Marine_Group_II 1 family 36172 2.7645693e-05 ICM_ABR_Av6 ABR_0006_2005_01_07 GAAACTCACCAAGGCCGACTGTTAGATGAAGATCAGCGTAATGAGCTTATCGGATTTTCAGAGAG Archaea;Euryarchaeota;Thermoplasmata;Thermoplasmatales;Marine_Group_II 13 family 36172 0.000359394006 ICM_ABR_Av6 ABR_0006_2005_01_07 GAAACTCACCAAGGCCGACTGTTAGATGAAGATCAGCGTAATGAGCTTATCGGATTTTCTAGAGAG Archaea;Euryarchaeota;Thermoplasmata;Thermoplasmatales;Marine_Group_II 1 family 36172 2.7645693e-05 ICM_ABR_Av6 ABR_0006_2005_01_07 GAAACTCACCAAGGCCGACTGTTAGATGAAGATTCAGCGTAATGAGCTTATCGGATTTTCAGAGAG Archaea;Euryarchaeota;Thermoplasmata;Thermoplasmatales;Marine_Group_II 1 family 36172 2.7645693e-05 ICM_ABR_Av6 ABR_0006_2005_01_07 GAAACTCACCAGGGCCGACTGTTAGATGAAGACCAGTGTAACGAACTTGTCGGATTTTCAGAGAG Archaea;Euryarchaeota;Thermoplasmata;Thermoplasmatales;Marine_Group_II 1 family 36172 2.7645693e-05 ICM_ABR_Av6 ABR_0006_2005_01_07 GAAACTCACCAGGGCCGACTGTTATATGAAGACCAATGTGATGAACTTGTCGGATTTTCAGAGAG Archaea;Euryarchaeota;Thermoplasmata;Thermoplasmatales;Marine_Group_II 2 family 36172 5.5291386e-05 ICM_ABR_Av6 ABR_0006_2005_01_07 GAAACTCACCAGGGCCGACTGTTATATGAAGACCAGCGTAATGAGCTTGTCGGATTTTCAGAGAG Archaea;Euryarchaeota;Thermoplasmata;Thermoplasmatales;Marine_Group_II 1 family 36172 2.7645693e-05 ICM_ABR_Av6 ABR_0006_2005_01_07 GAAACTCACCAGGGCCGACTGTTATATGAAGACCAGCGTGATGAACTTGTCGGATTTTCAGAGAG Archaea;Euryarchaeota;Thermoplasmata;Thermoplasmatales;Marine_Group_II 1 family 36172 2.7645693e-05

Okay, so this is a bizarre way to share sequence data but at this point I was willing to take it.  Basically the file gives us unique reads, their abundance (knt), and taxonomy according to VAMPS.  Want I needed to do was convert this into a separate, non-unique fasta file for each sample.  With a little help from Pandas this wasn’t too bad.  The vamps_icomm_surface.csv file specified below is just a csv version of the pdf table in the PNAS publication, after some tidying up.  The second row of the csv file is the sample name in VAMPS (I had to split the first column of the original table on ‘.’ to get this).

import pandas as pd ## get the metadata for surface samples, this was generated ## from the pdf table in the PNAS paper metadata = pd.read_csv('vamps_icomm_surface.csv', index_col = 1) ## get the seqs, downloaded from seqs = pd.read_csv('icomm_data.tsv', delimiter = '\t', index_col = 1) ## iterate across each study in metadata, making a temp dataframe ## of reads from that study for study in metadata.index: temp = seqs.loc[study] temp_bacteria = temp[temp['taxonomy'].str.contains('Bacteria')] ## now iterate across each row in the temp dataframe, exporting ## the sequence the specified number of times to a fasta file with open(study + '.bacteria.fasta', 'w') as output: for row in temp.iterrows(): sequence = row[1]['sequence'] for i in range(1, row[1]['knt']): name = row[0] + '.' + str(i) print name print >> output, '>' + name print >> output, sequence

To end I’d like to circle back to the original dual problems of poor database design and erroneous accession numbers.  Although these problems were not directly connected in this case they exacerbated each other, magnifying the overall problem of data (non)reuse.  Some suggestions:

  1.  The SRA and other databases needs to aggregate the run accession numbers for each new study.  A couple of clicks should lead me to a download link for every sequence associated with a study, whether that study generated the raw data or only used it.  I don’t believe it is currently possible to do this within SRA; initiating a new project assumes that you will contribute new raw sequences (correct me if you think otherwise).  This should be easy to implement, and the association of a publication with a project should be rigorously enforced by journals, departments, and funding agencies.
  2. Accession numbers should be automatically checked by journals the way references are.  If it points to something that isn’t downloadable (or at least has a download link on the page) the author should correct it.  There is no excuse for incorrect, incomplete, or non-run accession numbers.  That this was allowed to happen in PNAS – a top-tier journal – is double frustrating.  And frankly, if you reviewed this paper you owe me beer for not spot checking the refs.
  3. Validating this information is the job of the reviewers.  Almost nobody does this.  Usually when I review a paper containing sequence data I’m the only reviewer that insists the data is available at the time of publication (I can’t think of a single case where I wasn’t the only reviewer to insist).  Not only is this a necessary check, but part of the job of the reviewer is to spot check the analysis itself.  Clearly, without data that’s not being done either…

And that’s enough ranting about data availability for one day, back to work…

Roadmap to Ocean Worlds: Polar microbial ecology and the search for boring, normal life

Mon, 05/16/2016 - 20:11

Recently congress recommended that NASA create an Ocean Worlds Exploration Program whose primary goal is “to discover extant life on another world using a mix of Discovery, New Frontiers, and flagship class missions”.  Pretty awesome.  In February I was invited to participate on the science definition team for the Roadmap to Ocean Worlds (ROW) initiative.  The ROW is step one in NASA’s response to the congressional recommendation.  Over the last few months the science definition team has been meeting remotely to hash through some challenging questions around the central themes of identifying ocean worlds, defining their habitability, and understanding the life that might live in or on them.  This week we will have our first and only opportunity to meet in person as a team.  As one of very few biologists in the group, and possible the only ecologist, I’ll be giving a short talk on polar microbial ecology as it pertains to discovering and understanding life on other worlds.  As a way to organize my thoughts on the subject and prep for the talk I’m going to try an experiment and write out the main points of the presentation as an article.

I decided to title the talk (and this post) Polar microbial ecology and the search for boring, normal life because it reflects an idea that I and others have been advocating for some time: there isn’t anything that special about extreme environments (gasp!).  Life functions there in pretty much the same way that it functions everywhere else.  One way to think of this is that polar microbes (and other extremophiles) are uniquely adapted but not unique; they follow the standard rules of ecology and evolution, including how they acquire energy and deal with stress.  The nuances of how they interact with their environment reflect the unique characteristics of that environment, but the microbes are generally kind of conventional.

Act I: Challenges and opportunities for polar microbes

As with microbes in any environment, polar microbes are presented with a range of challenges and opportunities that define how they live.  From a microbial perspective challenges are stressors, or things that cause damage.  Opportunities are sources of energy.  Very often these things are the same.  Take sunlight, for example.  This is the single most important source of energy on the planet, but get a little too much of it (or any of it, to be more accurate), and you start doing damage to yourself.  As a result you have to invest some of the acquired energy into offsetting this stress.  We can illustrate this with a very simple conceptual model.

stress_energyThe above figure shows a hypothetical (but plausible) relationship between energy and stress.  Energy is shown both on the x-axis and as the blue 1:1 line.  Stress is the orange line.  In this scenario as energy increases, stress increases logarithmically.  The energy that is available for growth (that which is not dedicated to offsetting the damage caused by stress) is energy – stress, or the difference between the blue and orange lines.  That looks like this:

biomass_stressThe units for biomass are, of course, arbitrary, what matters is the shape of the curve.  Past a critical energy value the amount of biomass an ecosystem can sustain actually decreases as energy increases.  At the point at which stress = energy, no biomass is being accumulated.  It isn’t a coincidence that the shape of the biomass curve is the same as experimentally determined temperature curves for bacterial isolated belonging to different temperature classes (below).  In those curves the energy in the system is increasing as temperature rises.  This enhances enzymatic reactions (producing more biomass) until there is so much energy that the system becomes disordered (enzymes denature).  This process is directly analogous to the one presented here, but takes place at the population level rather than the ecosystem level.


Figure from  Temperature curves for a psychrophilic (optimized to low temperature) bacteria and a mesophilic (optimized to roughly human body temperature) relative.

One thing that I would like to make clear up front is that low temperature itself is not a stressor.  Low temperature alone doesn’t kill single-celled organisms, it only impedes their ability to deal with other stress.  Because of this energy and stress have a very complicated relationship with temperature in cold ecosystems.  For example:

  • As temperature decreases, energy decreases.
  • As energy decreases, stress decreases (remember that in our conceptual model stress is a function of energy).
  • As temperature decreases, access to substrate (and by proxy energy) increases.
  • As temperature decreases, inhibitory compounds (which are stress) increases.

So many of the effects of low temperature are in conflict.  Highly adapted psychrophiles play off these conflicts to establishes niches in cold environments.  We can illustrate this idea by focusing on the last two bullet points, which are specific to environments that contain ice.  Impure ice (which virtually all environmental ice is) does not form a solid structure when it freezes.  It forms a porous matrix, with the porosity determined by the concentration of solutes and the temperature.  You can actually view this in 3D using sophisticated X-ray tomography techniques, as shown in this figure from Pringle et al., 2009.

From Pringle et al., 2009

From Pringle et al., 2009.  X-ray tomography images of saline ice at different temperatures.  The gold represents pore spaces, which decrease in size as the temperature drops.

The gold spaces in this figure are pore spaces within the ice.  As the ice gets colder the pore spaces become smaller and the solutes contained in them become more concentrated.  That’s because whatever was dissolved in the starting material (e.g. salt, sugars, small particles) is excluded from the ice as it forms crystals, the excluded material ends up in the pore spaces.  The colder the ice, the smaller the spaces, the more concentrated the solutes.  Bacteria and ice algae that are also excluded into these spaces will experience a much higher concentration of nutrients, potential sources of carbon, etc.  This equates to more energy, which helps them offset the stress of being in a very salty environment.  In very cold, relatively pure ice, solute concentrations can be 1000x those of the starting material.  Neat, right?

Here’s what decreasing temperature actually means for a bacterial cell trying to make a go of it in a cold environment:

quick_blog_figOur earlier plot of biomass depicted cells in the “growth” phase shown here.  Only very early in the plot, when energy was very low, and at the very end, when stress was very high, did biomass approach zero.  That zero point is called the maintenance phase.  Maintenance phase happens when decreasing temperature suppresses the ability to deal with stress to the point that a cell cannot invest enough resources into creating biomass to reproduce.  The cell is investing all its energy in offsetting the damage caused by stress.

At increasingly low temperatures the synthesis of new biomass becomes increasingly less efficient, requiring a proportionally greater expenditure of energy (we say that bacterial growth efficiency decreases).  At the end of the maintenance phase the bacterial cell is respiring, that is it is creating energy, but this is resulting in virtually no synthesis or repair of cellular components.  This leads inevitably to cell death when enough cellular damage has accumulated.  It’s sad.

If we lift our heads up out of the thermodynamic weeds for a moment we can consider the implications of all this for finding life on another cold world (all the good ones are cold):

quick_blog_fig2On the assumption that everyone’s asleep by this point in the presentation I’m trying to be funny, but the point is serious.  You hear a lot of silly (my bias) talk in the astrobiology community about life at the extremes; how low can it go temp wise etc.  Who cares?  I’m not particularly interested in finding a maintenance-state microbial community, nor do we stand a very good chance of detecting one even if it was right under our (lander’s) nose.  The important contribution to make at this points is to determine where life is actively growing on the relevant ocean worlds.  Then we can try to figure out how to reach those places (no mean task!).  The point of this whole exercise it to find extant life, after all…

Act II: Where are polar microbes distributed and why?

I can think of 8 polar environments that are particularly relevant to ocean worlds:

  • Supraglacial environment (glacier surface)
  • Interstitial glacier environment (glacier interior)
  • Subglacial environment (where the glacier meets rock)
  • Sea ice surface (top’o the sea ice)
  • Interstitial sea ice (sea ice interior)
  • Sea ice-seawater interface (where the sea ice meets the seawater)
  • Water column below ice (the ocean! or a lake)
  • The sediment-water interface (where said lake/ocean meets mud)

Each of these environments has a range of stress, energy, and temperatures, and this to a large extent defines their ability to support biomass.  For the purpose of this discussion I’ll report biomass in units of bacteria ml-1.  To give some sense of what this means consider that what the waste-treatment profession politely calls “activated sludge” might have around 108 bacteria ml-1.  Bottled water might have 103 bacteria ml-1.  Standard ocean water ranges between 104 bacteria ml-1 and 106 bacteria ml-1 at the very highest end.  One further note on biomass… our conceptual model doesn’t account for grazing, or any other trophic transfer of biomass, because that biomass is still in the biological system.  Bacteria ml-1 doesn’t reflect this, because bacteria are consumed by other things.  So the values I give for the ecosystems below are at best a loose proxy for the capacity of each ecosystem to support biomass.

From left to right, top to bottom: supraglacial environment (, interstitial glacier environment (Price, 2000), subglacier environment – arguably a subglacial lake, but a very shallow one with interface characteristics (, sea ice surface (hey, that’s me!), sea ice interior (Krembs et al., 2002), sea ice-seawater interface (I.A. Melnikov), water below ice (from our last season at Palmer Station), sediment-water interface (

Let’s consider the sea ice-seawater interface and the sea ice surface in a little more detail.  The sea ice-seawater interface is interesting because it has, among all polar microbial environments, probably the greatest capacity to support biomass (we’re talking about summertime sea ice, of course).  It is located at the optimum balance point between energy and stress, as shown in the figure below.  Despite the fact that the sea ice surface has relatively high biomass, it is located at the extreme right side of the plot.  The trick is that biomass has accumulated at the sea ice surface as a result of abiotic transport, not in situ growth.  I explored this pretty extensively back in the very early days of my PhD (see here and here).


Despite the fact that the sea ice surface doesn’t function as a microbial habitat, the concentration of biomass there might be relevant to our goal of finding extant life on another world.  The accumulation of bacteria at the ice surface is largely the result of bacteria being passively transported with salt.  Recall that saline ice is a porous matrix containing a liquid brine.  All of the bacteria in the ice are also contained in the liquid brine; as the ice cools the pore spaces get smaller, forcing some of the brine and bacteria to the ice surface.  Porous ice in say, the Europan ice shell would act similarly.  Salt is easier to identify than life (in fact we know that there are large salt deposits on the surface of Europa), so we can target salty areas for deeper search.  In astrobiology we like to “follow things” (because we get lost easy?); “follow the water” and “follow the energy” are often cited and somewhat useful axioms.  So here we have a “follow the salt” situation.

Act III: Can we explore polar microbial ecology in life detection?

Act II ended on a life detection note, so let’s follow this through and consider other details of polar microbial ecology that might give us clues of what to look for if we want to identify life.  Getting ecology to be part of this discussion (life detection) is non-trivial.  Despite the fact that the whole field of astrobiology is really an elaborate re-visioning of classic ecology (who lives where and why), there isn’t a whole lot of interest in studying how organisms interact with their environment.  This simply reflects the field’s disciplinary bias; the most active communities within astrobiology may well be astronomers and planetary geologists.  If the most active community was ecologists we’d probably be ignoring all kinds of important stuff about how planets are formed, etc.  To illustrate the utility of ecology for life detection here are two examples of ecological principles that might lead to life detection techniques.

Case 1 – Biological alteration of the microstructure of ice

From Krembs et al., 2002.

From Krembs et al., 2002.  This sea ice diatom has filled the pore space with EPS as a buffer against high salinity and as a cryoprotectant.

Earlier I put forward that decreasing temperatures can lead to heightened stress because life in ice is exposed to higher concentrations of damaging substances at lower temperatures.  The most obvious example is salt.  In very cold sea ice the pore space salinity approaches 200 ppt, that’s roughly 6 times the salinity of seawater!  Maintaining adequate internal water is a huge challenge for a cell under these conditions.  One of the mechanisms for dealing with this is to produce copious amounts of exopolysaccharides (EPS), a hydrated gel (essentially mucous) containing polysaccharides, proteins, and short peptides.  EPS buffers the environment around a cell, raising the activity of water and, in some cases, interacting with ice.  This produces a highly modified internal structure, as shown in the images below.  This alteration could be a useful way of identifying ice on an ocean world that has been modified by a biological community.

From Krembs et al., 2011.

From Krembs et al., 2011.

Case 2: Motility

Motility has been put forward before as an unambiguous signature of life, but the idea hasn’t really gained a lot of traction.  Clearly one needs to be cautious with this – Brownian motion can look a lot like motility – but I can’t think of anything else that life does that is as easily distinguishable from abiotic processes.  One additional challenge however, is the not all life is motile.  Plants aren’t motile, at least most of them over the timescales that we care to stare at them for.  Not all microbes are motile either, but I would argue that those that aren’t aren’t only because others are.

From Stocker, 2012.

From Stocker, 2012.

Consider the figure at right, which is a cartoon of two modes of bacterial life in the ocean.  One of those modes is motile, and can be seen using flagella to follow chemical gradients (we call this chemotaxis) and optimize their location with respect to phytoplankton, their source of carbon.  The second mode is much smaller and in the background; small-bodied non-motile cells that live on the diffuse carbon that they opportunistically encounter.  This works because that would be an inefficient niche for motile bacteria to exploit.  In the absence of motility however, chemical gradients constitute a very strong selective pressure to evolve motility.  We can see evidence of this in the convergent evolution of flagellar motility (and other forms of motility) in all three domains of life.  Although they may share a common chemotaxis sensory mechanism, the Bacteria, Archaea, and Eukarya all seem to have evolved flagellar motility independently.  This means that it’s probably a pretty good feature to have, and is likely to be shared by microbes on other ocean worlds.

That was quite a lot, so to summarize:

  • Microbial communities are oriented along gradients of energy and stress.  At some optimal point along that gradient a maximum amount of biomass can be supported by surplus energy that is not being used to deal with stress (How much energy do you spend on stress?  Think of how much more biomass you could have!).
  • The relationships between energy, stress, and temperature are complicated, but Earth life generally works at T > -12 °C.  This estimate is probably a little lower than the reality, because laboratory observations of growth at that temperature don’t accurately reflect environmental stress.
  • Life is strongly biased towards surfaces and interfaces, these may provide enhanced opportunities for life detection (follow the salt!).
  • The specific ecology of cold organisms can provide some further insights into life detection strategies.  For example, motility might be an under-appreciated signature of life.


Phylogenetic placement re-re-visited

Fri, 05/13/2016 - 15:48

Disclaimer: I banged this out fast from existing scripts to help some folks, but haven’t tested it yet.  Will do that shortly, in the meantime, be careful!


I use phylogenetic placement, namely the program pplacer, in a lot of my publications.  It is also a core part of of the paprica metabolic inference pipeline.  As a result I field a lot questions from people trying to integrate pplacer into their own workflows.  Although the Matsen group has done an excellent job with documentation for pplacer, guppy, and taxtastic, the three programs you need to work with to do phylogenetic placement from start to finish (see also EPA), there is still a steep learning curve for new users.  In the hope of bringing the angle of that curve down a notch or two, and updating my previous posts on the subject (here and here), here is a complete, start to finish example of phylogenetic placement, using 16S rRNA gene sequences corresponding to the new tree of life recently published by Hug et al.  To follow along with the tutorial start by downloading the sequences here.

You can use any number of alignment and tree building programs to create a reference tree for phylogenetic placement.  I strongly recommend using RAxML and Infernal.  After a lot of experimentation this combination seems to be produce the most correct topologies and best supported trees.  You should be aware that no 16S rRNA gene tree (or any other tree) is absolutely “correct” for domain-level let alone life-level analyses, but nothing in life is perfect.  While you’re installing software I also recommend the excellent utility Seqmagick.  Finally, you will need a covariance model of the 16S rRNA gene to feed into Infernal.  You can find that at the Rfam database here.

The workflow will follow these steps:

  1. Create an alignment of the reference sequences with Infernal
  2. Create a phylogenetic tree of the alignment
  3. Create a reference package from the alignment, tree, and stats file
  4. Proceed with the phylogenetic placement of your query reads

Create an alignment of the reference sequences

The very first thing that you need to do is clean your sequence names of any wonky punctuation.  This is something that trips up almost everyone.  You should really have no punctuation in the names beyond “_”, and no spaces!

tr " -" "_" < hug_tol.fasta | tr -d "%\,;():=.\\*[]\"\'" > hug_tol.clean.fasta

Next create an alignment from the cleaned file.  I always like to degap first, although it shouldn’t matter.

## Degap seqmagick mogrify --ungap hug_tol.clean.fasta ## Align cmalign --dna -o hug_tol.clean.align.sto --outformat Pfam hug_tol.clean.fasta ## Convert to fasta format seqmagick convert hug_tol.clean.align.sto hug_tol.clean.align.fasta

Build the reference tree

At this point you should have a nice clean DNA alignment in the fasta format.  Now feed it to RAxML to build a tree.  Depending on the size of the alignment this can take a little bit.  I know you’ve read the entire RAxML manual so of course you are already aware that adding additional cpus won’t help…

raxmlHPC-PTHREADS-AVX2 -T 8 -m GTRGAMMA -s hug_tol.clean.align.fasta -n ref.tre -f d -p 12345

I like having a sensibly rooted tree; it’s just more pleasing to look at.  You can do this manually, or you can have RAxML try to root the tree for you.

raxmlHPC-PTHREADS-AVX2 -T 2 -m GTRGAMMA -f I -t RAxML_bestTree.ref.tre -n root.ref.tre

Okay, now comes the tricky bit.  Clearly you’d like to have some support values on your reference tree, but the Taxtastic program that we will use to build the reference tree won’t be able to read the RAxML stats file if it includes confidence values.  The work around is to build a second tree with confidence values.  You will feed this tree to Taxtastic with the stats file from the tree we already generated.

## Generate confidence scores for tree raxmlHPC-PTHREADS-AVX2 -T 8 -m GTRGAMMA -f J -p 12345 -t RAxML_rootedTree.root.ref.tre -n conf.root.ref.tre -s hug_tol.clean.align.fasta

Now we can use the alignment, the rooted tree with confidence scores, and the stats file without confidence scores to create our reference package.

taxit create -l 16S_rRNA -P hug_tol.refpkg --aln-fasta hug_tol.clean.align.fasta --tree-stats RAxML_info.ref.tre --tree-file RAxML_fastTreeSH_Support.conf.root.ref.tre

Align the query reads

At this point you have the reference package and you can proceed with analyzing some query reads!  The first step is to align the query reads in exactly the same fashion as the reference sequences.  This is important as the alignments will be merged later.

## Clean the names tr " -" "_" < query.fasta | tr -d "%\,;():=.\\*[]\"\'" > query.clean.fasta ## Remove any gaps seqmagick mogrify --ungap ## Align cmalign --dna -o query.clean.align.sto --outformat Pfam query.clean.fasta

Now we use the esl-alimerge command, included with Infernal, to merge the query and reference alignments.

## Merge alignments esl-alimerge --outformat pfam --dna -o query.hug_tol.clean.align.sto query.clean.align.sto hug_tol.refpkg/hug_tol.clean.align.sto ## Convert to fasta seqmagick convert query.hug_tol.clean.align.sto

Phylogenetic placement

Now we’re on the home stretch, we can execute the phylogenetic placement itself!  The flags are important here, so it’s worth checking the pplacer documentation to insure that your goals are consistent with mine (get a job, publish some papers?).  You can probably accept most of the flags for the previous commands as is.

pplacer -o query.hug_tol.clean.align.jplace -p --keep-at-most 20 -c hug_tol.refpkg query.hug_tol.clean.align.fasta

At this point you have a file named query.hug_tol.clean.align.jplace.  You will need to use guppy to convert this json-format file to information that is readable by human.  The two most useful guppy commands (in my experience) for a basic look at your data are:

## Generate an easily parsed csv file of placements, with only a single placement reported for each ## query read. guppy to_csv --point-mass --pp -o query.hug_tol.clean.align.csv query.hug_tol.clean.align.jplace ## Generate a phyloxml tree with edges fattened according to the number of placements. guppy fat --node-numbers --point-mass --pp -o query.hug_tol.clean.align.phyloxml query.hug_tol.clean.align.jplace

Another victim of science funding

Tue, 04/12/2016 - 15:27

Or rather the lack thereof.  I was very disappointed to receive an email yesterday that BioCyc, a popular database of enzymes and metabolic pathways in model organisms, is moving to a subscription model.  The email is posted below in its entirety (to assist in the request to spread the word).  With this move BioCyc cements the pattern established several years ago by the Koyoto Encyclopedia of Genes and Genomes (KEGG).  I don’t begrudge the architects of these databases for moving behind a paywall; as the BioCyc principal investigator Peter Karp notes below, developing and maintaining a high-quality biological database is time and resource intensive, and skilled curators are expensive.  I do think however, that moving these resources to a subscription-based model does a disservice to the research community and is not in the public interest.  While it was not the responsibility of US funding agencies to ensure the long-term viability and accessibility of KEGG, BioCyc is in their court.  In my opinion the failure of NSF and NIH to adequately support community resources amounts to a dereliction of duty.

As noted in the letter below one of the challenges faced by database designers and curators is the tradition of peer review.  At its best the peer review process is an effective arbitrator of research spending.  I think that the peer review process is at its best only under certain funding regimes however, and suspect that the current low rates of federal funding for science do not allow for fair and unbiased peer review.  This is particularly the case for databases and projects whose return on investment is not evident in the short term in one or two high-impact publications, but in the supporting role played in dozens or hundreds or thousands of studies across the community over many years.  Program officers, managers, and directors need to be cognizant of the limitations of the peer review process, and not shy away from some strategic thinking every now and again.

My first experience with BioCyc came in my first year of graduate school, when I was tentatively grappling with the relationship between gene and genome functions and the ecology of cold-adapted microbes.  Like many academic labs around the country the lab I was in was chronically short on funds, and an academic license for expensive software (e.g. CLC Workbench) or a database subscription (á la KEGG or BioCyc – both were thankfully free back in those days) would have been out of the question.  Without these tools I simply would have had nowhere to start my exploration.  I fear that the subscription model creates intellectual barriers, potentially partitioning good ideas (which might arise anywhere) from the tools required to develop them (which will only be found in well-funded or specialist labs).

Viva community science!


Dear Colleague,

I am writing to request your support as we begin a new chapter in the development of BioCyc.

In short, we plan to upgrade the curation level and quality of many BioCyc databases to provide you with higher quality information resources for many important microbes, and for Homo sapiens.  Such an effort requires large financial resources that — despite numerous attempts over numerous years — have not been forthcoming from government funding agencies.  Thus, we plan to transition BioCyc to a community-supported non-profit subscription model in the coming months.

Our Goal

Our goal at BioCyc is to provide you with the highest quality microbial genome and metabolic pathway web portal in the world by coupling unique and high-quality database content with powerful and user-friendly bioinformatics tools.

Our work on EcoCyc has demonstrated the way forward.  EcoCyc is an incredibly rich and detailed information resource whose contents have been derived from 30,000 E. coli publications.  EcoCyc is an online electronic encyclopedia, a highly structured queryable database, a bioinformatics platform for omics data analysis, and an executable metabolic model.  EcoCyc is highly used by the life-sciences community, demonstrating the need and value of such a resource.

Our goal is to develop similar high-quality databases for other organisms.  BioCyc now contains 7,600 databases, but only 42 of them have undergone any literature-based curation, and that occurs irregularly.  Although bioinformatics algorithms have undergone amazing advances in the past two decades, their accuracy is still limited, and no bioinformatics inference algorithms exist for many types of biological information.  The experimental literature contains vast troves of valuable information, and despite advances in text mining algorithms, curation by experienced biologists is the only way to accurately extract that information.

EcoCyc curators extract a wide range of information on protein function; on metabolic pathways; and on regulation at the transcriptional, translational, and post-translational levels.

In the past year SRI has performed significant curation on the BioCyc databases for Saccharomyces cerevisiae, Bacillus subtilis, Mycobacterium tuberculosis, Clostridium difficile, and (to be released shortly) Corynebacterium glutamicum. All told, BioCyc databases have been curated from 66,000 publications, and constitute a unique resource in the microbial informatics landscape.  Yet much more information remains untapped in the biomedical literature, and new information is published at a rapid pace.  That information can be extracted only by professional curators who understand both the biology, and the methods for encoding that biology in structured databases.  Without adequate financial resources, we cannot hire these curators, whose efforts are needed on an ongoing basis.

Why Do We Seek Financial Support from the Scientific Community?

The EcoCyc project has been fortunate to receive government funding for its development since 1992.  Similar government-supported databases exist for a handful of biomedical model organisms, such as fly, yeast, worm, and zebrafish.

Peter Karp has been advocating that the government fund similar efforts for other important microbes for the past twenty years, such as for pathogens, biotechnology workhorses, model organisms, and synthetic-biology chassis for biofuels development.  He has developed the Pathway Tools software as a software platform to enable the development of curated EcoCyc-like databases for other organisms, and the software has been used by many groups. However, not only has government support for databases not kept pace with the relentless increases in experimental data generation, but the government is funding few new databases, is cutting funding for some existing databases (such as for EcoCyc, for BioCyc, and for TAIR), and is encouraging the development of other funding models for supporting databases [1].  Funding for BioCyc was cut by 27% at our last renewal whereas we are managing five times the number of genomes as five years ago. We also find that even when government agencies want to support databases, review panels score database proposals with low enthusiasm and misunderstanding, despite the obvious demand for high-quality databases by the scientific community.

Put another way: the Haemophilus influenzae genome was sequenced in 1995.  Now, twenty years later, no curated database that is updated on an ongoing basis exists for this important human pathogen.  Mycobacterium tuberculosis was sequenced in 1998, and now, eighteen years later, no comprehensive curated database exists for the genes, metabolism, and regulatory network of this killer of 1.5 million human beings per year.  No curated database exists for the important gram-positive model organism Bacillus subtilis.  How much longer shall we wait for modern resources that integrate the titanic amounts of information available about critical microbes with powerful bioinformatics tools to turbocharge life-science research?

How it Will Work and How You Can Support BioCyc

The tradition whereby scientific journals receive financial support from scientists in the form of subscriptions is a long one.  We are now turning to a similar model to support the curation and operation of BioCyc.  We seek individual and institutional subscriptions from those who receive the most value from BioCyc, and who are best positioned to direct its future evolution.  We have developed a subscription-pricing model that is on par with journal pricing, although we find that many of our users consult BioCyc on a daily basis — more frequently than they consult most journals.

We hope that this subscription model will allow us to raise more funds, more sustainably, than is possible through government grants, through our wide user base in academic, corporate, and government institutions around the world.  We will also be exploring other possible revenue sources, and additional ways of partnering with the scientific community.

BioCyc is collaborating with Phoenix Bioinformatics to develop our community-supported subscription model.  Phoenix is a nonprofit that already manages community financial support for the TAIR Arabidopsis database, which was previously funded by the NSF and is now fully supported [2] by users.

Phoenix Bioinformatics  will collect BioCyc subscriptions on behalf of SRI International, which like Phoenix is a non-profit institution. Subscription revenues will be invested into curation, operation, and marketing of the BioCyc resource.

We plan to go slow with this transition to give our users time to adapt.  WeÕll begin requiring subscriptions for access to BioCyc databases other than EcoCyc and MetaCyc starting in July 2016.

Access to the EcoCyc and MetaCyc databases will remain free for now.

Subscriptions to the other 7,600 BioCyc databases will be available to institutions (e.g., libraries), and to individuals.  One subscription will grant access to all of BioCyc.  To encourage your institutional library to sign up, please contact your science librarian and let him or her know that continued access to BioCyc is important for your research and/or teaching.

Subscription prices will be based on website usage levels and we hope to keep them affordable so that everyone who needs these databases will still be able to access them.  We are finalizing the academic library and individual prices and will follow up soon with more information including details on how to sign up. We will make provisions to ensure that underprivileged scientists and students in third-world countries arenÕt locked out.

Please spread the word to your colleagues — the more groups who subscribe, the better quality resource we can build for the scientific community.

Peter Karp

Director, SRI Bioinformatics Research Group



Tutorial: Annotating metagenomes with paprica-mg

Sat, 03/26/2016 - 03:00

This tutorial is both a work in progress and a living document.  If you see an error, or want something added, please let me know by leaving a comment.

Starting with version 3.0.0 paprica contains a metagenomic annotation module.  This module takes as input a fasta or fasta.gz file containing the QC’d reads from a shotgun metagenome and uses DIAMOND Blastx to classify these reads against a database derived from the paprica database.  The module produces as output:

  1.  Classification for each read in the form of an EC number (obviously this applies only to reads associated with genes coding for enzymes).
  2. A tally of the occurrence of each EC number in the sample, with some useful supporting information.
  3. Optionally, the metabolic pathways likely to be present within the sample.

In addition to the normal dependencies paprica-mg requires DIAMOND Blastx.  Follow the instructions in the DIAMOND manual, and be sure to add the location of the DIAMOND executables to your PATH.  If you want to predict metabolic pathways on your metagenome you will need to also download the pathway-tools software.  See the notes here.

There are two ways to obtain the paprica-mg database.  You can obtain a pre-made version of the database by downloading the files paprica-mg.dmnd and (large!) to the ref_genome_database directory.  Be sure to gunzip before continuing.  If you wish to build the paprica-mg database from scratch, perhaps because you’ve customized that database or are building it more frequently than the release cycle, you will need to first build the regular paprica database.  Then build the paprica-mg database as such: -ref_dir ref_genome_database

If you’ve set paprica up in the standard way you can be execute this command from anywhere on your system; the paprica directory is already in your PATH, and the script will look for the directory “ref_genome_database” relative to itself.  Likewise you don’t need to be in the paprica directory to execute the script.

Once you’ve downloaded or  built the database you can run your analysis.  It is worth spending a little time with the DIAMOND manual and considering the parameters of your system.  To try things out you can download a “smallish” QC’d metagenome from the Tara Oceans Expedition (selected randomly for convenient size):

## download a test metagenome wget ## execute paprica-mg for EC annotation only -i ERR318619_1.qc.fasta.gz -o test -ref_dir ref_genome_database -pathways F

This will produce the following output:

test.annotation.csv: The number of hits in the metagenome, by EC number.  See the paprica manual for a complete explanation of columns. The DIAMOND format results file.  Only one hit per read is reported. A text file of the DIAMOND results.  Only one hit per read is reported.

Predicting pathways on a metagenome is very time intensive and it isn’t clear what the “correct” way is to do this.  I’ve tried to balance speed with accuracy in paprica-mg.  If you execute with -pathways T, DIAMOND is executed twice; once for the EC number annotation as above (reporting only a single hit for each read), and once to collect data for pathway prediction.  On that search DIAMOND reports as many hits for each read as there are genomes in the paprica database.  Of course most reads will have far fewer (if any) hits.  The reason for this approach is to try and reconstruct as best as possible the actual genomes that are present.  For example, let’s say that a given read has a significant hit to an enzyme in genome A and genome B.  When aggregating information for pathway-tools the enzyme in genome A and genome B will be presented to pathway-tools in separate Genbank files representing separate genetic elements.  Because a missing enzyme in either genome A or genome B could cause a negative prediction for the pathway, we want the best chance of capturing the whole pathway.  So a second enzyme, critical to the prediction of that pathway, might get predicted for only genome A or genome B.  The idea is that the incomplete pathways will get washed out at the end of the analysis, and since pathway prediction is by its nature non-redundant (each pathway can only be predicted once) over-prediction is minimized.  To predict pathways during annotation:

## execute paprica-mg for EC annotation and pathway prediction -i ERR318619_1.qc.fasta.gz -o test -ref_dir ref_genome_database -pathways T -pgdb_dir /location/of/ptools-local

In addition to the files already mentioned, you will see:

test_mg.pathologic: a directory containing all the information that pathway-tools needs for pathway prediction.

test.pathways.txt: A simple text file of all the pathways that were predicted.

test.paprica-mg.txt: A very large text file of all hits for each read.  You probably want to delete this right away to save space.

test.paprica-mg.daa: A very large DIAMOND results file of all hits for each read.  You probably want to delete this right away to save space.

testcyc: A directory in ptools-local/pgdbs/local containing the PGDB and prediction reports.  It is worth spending some time here, and interacting with the PGDB using the pathway-tools GUI.


Antarctic wind-driven bacterial dispersal paper published

Thu, 03/24/2016 - 12:53
Frost flowers over young sea ice in the central Arctic Ocean. Photo Mattias Wietz.

Frost flowers over young sea ice in the central Arctic Ocean. Photo Mattias Wietz.

I’m happy to report that one of the appendices in my dissertation was just published in the journal Polar Biology.  The paper, titled Wind-driven distribution of bacteria in coastal Antarctica: evidence from the Ross Sea region, was a long time in coming.  I conceived of the idea back in 2010 when it looked like my dissertation would focus on the microbial ecology of frost flowers; delicate, highly saline, and microbially enriched structures found on the surface of newly formed sea ice.  Because marine bacteria are concentrated in frost flowers we wondered whether they might serve as source points for microbial dispersal.  This isn’t as far-fetched as it might seem; bacteria are injected into the atmosphere through a variety of physical processes, from wind lofting to bubble-bursting, and frost flowers have been implicated as the source of wind deposited sea salts in glaciers far from the coast.

At the time we’d been struggling to reliably sample frost flowers at our field site near Barrow, Alaska.  Frost flowers form readily there throughout the winter, but extremely difficult sea ice conditions make it hard to access the formation sites.  We knew that there were more accessible formation sites in the coastal Antarctic, so we initiated a one year pilot project to sample frost flowers from McMurdo Sound.  By comparing the bacterial communities in frost flowers, seawater, sea ice, terrestrial snow, and glaciers, we hoped to show that frost flowers were a plausible source of marine bacteria and marine genetic material to the terrestrial environment.  Because the coastal Antarctic contains many relic marine environments, such as the lakes of the Dry Valleys, the wind-driven transport of bacteria from frost flowers and other marine sources could be important for continued connectivity between relic and extant marine environments.

Frost flowers are readily accessible in McMurdo Sound throughout the winter, however, this does not mean that one can simply head out and sample them.  While the ice conditions are far more permissible than at Barrow, Alaska, the bureaucracy is also far more formidable.  The can-do attitude of our Inupiat guides in Barrow (who perceive every far-out field plan as a personal challenge) was replaced with the inevitable can’t-do attitude at McMurdo (this was 2011, under the Raytheon Antarctic Support Contract, and does not reflect on the current Lockheed Antarctic Support Contract, not to suggest that this attitude doesn’t persist).  Arriving in late August we were initially informed that our plan was much to risky without helicopter support, and that nothing could be done until mid-October when the helicopters began flying (we were scheduled to depart late October).  Pushing for a field plan that relied on ground transport ensnared us in various catch-22’s, such as (paraphrased from an actual conversation):

ASC representative: You can’t take a tracked vehicle to the ice edge, they’re too slow.

Me: Can we take a snowmobile to the ice edge?  That would be faster.  We do long mid-winter trips in the Arctic and it works out fine.

ASC representative: No, because you have to wear a helmet, and the helmets give you frostbite.  So you can only use a snowmobile when it’s warm out.

Ultimately we did access the ice edge by vehicle several times before the helicopters started flying, but the samples reported in this publication all came from a furious two week period in late October.  What we found really surprised us.

Sampling frost flowers is as easy as scraping them from the ice surface with a clean shovel into a sterile plastic bin.

Sampling frost flowers on a warm day in September, 2011, in McMurdo Sound.

Pull very short ice cores from young sea ice after sampling frost flowers. You can see one of the short cores in the lower right of the image.

Sampling some of the frost flowers used in this study over much thinner ice north of Ross Island in October, 2011.

 Shelly Carpenter.

Sampling “blue ice” on Taylor Glacier a few days earlier.  Comparing various terrestrial ice environments with the marine samples gave us some interesting insights on how bacteria are dispersed by strong winter winds in the coastal Antarctic.

There is ample evidence for the wind-driven transport of bacteria in this region but, contrary to our hypothesis, most of that material is coming from the terrestrial environment.  The major transportees were a freshwater cyanobacterium from the genus Pseudanabaena and a set of sulfur-oxidizing Gammaproteobacteria (GSO).  The cyanobacterium was pretty easy to understand; it forms mats in a number of freshwater lakes and meltponds in the region.  In the winter these freeze, and since snow cover is low, ice and microbial mats are ablated by strong winter winds.  Little pieces of mat are efficiently scattered all over, including onto the sea ice surface.


Easily dispersed: desiccated microbial mat in Lake Chad, 2005
(photo by A. Chiuchiolo), taken from

The GSO threw us for more of a loop; the most parsimonious explanation for their occurrence in frost flowers is that they came from hydrothermal features on nearby Mt. Erebus.  We did some nice analysis with wind vectors in the region and while you don’t get a lot of wind (integrated over time) to move material from Mt. Erebus to our sample sites, you do get some occasional very strong storms.

Wind velocity and magnitude for the study region in October, 2011. Taken from Bowman and Deming, 2016.

Wind velocity and magnitude for the study region in October, 2011. Taken from Bowman and Deming, 2016.

What all this means is that, consistent with other recent findings, there is high regional dispersal of microbes around the coastal Antarctic.  While I’m sure there are some endemic microbes occupying particularly unique niches, in general I expect microbes found in one part of the coastal Antarctic to be present in a similar environment in a different part of the coastal Antarctica.  There are however, quite a few ways to interpret this.  Bacteria and Archaea can evolve very fast, so the genome of a clonal population of (for example) wind deposited Pseudanabaena newly colonizing a melt pond can diverge pretty fast from the genome of the parent population.  This has a couple of implications.  First it means that the coastal Antarctic, with all it’s complex topography yet high degree of microbial connectivity, is an excellent place to explore the dynamics of microbial adaptation and evolution, particularly if we can put constraints on the colonization timeline for a given site (non trivial).  Second, it raises some questions about the propriety of commercially relevant microbes obtained from the continent.  The commercialization of the continent is probably inevitable (I hope it is not), perhaps the potential ubiquity of Antarctic microbes will provide some defense against the monopolization of useful strains, enzyme, and genes.

Building the paprica database

Fri, 03/11/2016 - 11:04

This tutorial is both a work in progress and a living document.  If you see an error, or want something added, please let me know by leaving a comment.

Build the paprica database provides maximum flexibility with paprica but involves more moving parts and resources than conducting analysis with paprica against the provided database.  Basic instructions for using the script are provided in the manual, this tutorial is intended to provide an even more detailed step-by-step guide.


While a laptop running Linux, VirtualBox, or OSX is perfectly adequate for analysis with paprica, you’ll need something a little beefier for building the database (unless you’re really patient).  A high performance cluster is overkill, I build the provided database on a basic 12 core Linux workstation with 32 Gb RAM (< $5k).  Something in this ballpark should work fine, of course more cores will get the job done faster (but keep an eye on memory useage).

Once you’ve got the hardware requirements sorted out you need to download the dependencies.  I recommend first following all the instructions for the script, then installing RAxML and pathway-tools.  The rest of this tutorial assumes you’ve done just that, including running the test file.

Install remaining dependencies

In addition to all the dependencies required by you need pathway-tools and RAxML.  These are very mainstream programs, but that doesn’t necessarily mean installation is easy.  In particular pathway-tools requires that you request a license (free for academic users).  This takes about 24 hours after which you’ll receive a link to download the installer.  Regardless of whether you’re sitting at the workstation or accessing via SSH a GUI will pop up and guide you through the installation.  In general you can accept the defaults, however, the GUI will ask you where pathway-tools short create to directory ptools-local.  This is where the program will create the pathway-genome databases that describe (among other things) the metabolic pathways in each genome.  By the time you are done creating the database this directory will be > 100 Gb, so pick a location with plenty of space!  This may not be your home directory (the default location).  For example on my system my home directory is housed on a small SSD.  To keep the home directory from becoming bloated I opted to locate ptools-local on a separate SATA drive.

You will receive a number of download options from the pathway-tools development team.  I recommend that you conduct only the basic installation of pathway-tools, and do not download and install additional PGDBs.  Nothing wrong with installing these additional, well-curated PGDBs other than increased space and time, but they become ponderous.  You can always add them later if you want to become a metabolic modeling rock star.

To be continued…

Correctly evaluating metabolic inference methods

Fri, 03/04/2016 - 11:47

Last week I gave a talk at the biennial Ocean Sciences Meeting that included some results from analysis with paprica.  Since paprica is a relatively new method I showed the below figure which is intended to validate the method.  The figure shows a strong correlation for four metagenomes between observed enzyme abundance and enzyme abundance predicted with paprica (from 16S rRNA gene reads extracted from the metagenome).  This is similar to the approach used to validate PICRUSt and Tax4Fun.

Spearman's correlation between predicted and observed enzyme abundance in four marine metagenomes.

Spearman’s correlation between predicted and observed enzyme abundance in four marine metagenomes.

The correlation looks decent, right?  It’s not perfect, but most enzymes are being predicted at close to their observed abundance (excepting the green points where enzyme abundance is over-predicted because metagenome coverage is lower).

After the talk I was approached by a well known microbial ecologist who suggested that I compare these correlations to correlations with a random collection of enzymes.  His concern was that because many enzymes (or genes, or metabolic pathways) are widely shared across genomes any random collection of genomes looks sort of like a metagenome.  I gave this a shot and here are the results for one of the metagenomes used in the figure above.

Correlation between predicted and observed (red) and random and observed (black) enzyme abundances.

Correlation between predicted and observed (red) and random and observed (black) enzyme abundances.

Uh oh.  The correlation is better for predicted than random enzyme abundance, but rho = 0.7 is a really good correlation for the random dataset!  If you think about it however, this makes sense.  For this test I generated the random dataset by randomly selecting genomes from the paprica database until the total number of enzymes equaled the number predicted for the metagenome.  Because there are only 2,468 genomes in the current paprica database (fewer than the total number of completed genomes because only one genome is used for each unique 16S rRNA gene sequence) the database gets pretty well sampled during random selection.  As a result rare enzymes (which are also usually rare in the metagenome) are rare in the random sample, and common enzymes (also typically common in the metagenome) are common.  So random ends up looking a lot like observed.

It was further suggested that I try and remove core enzymes for this kind of test.  Here are the results for different definitions of “core”, ranging from enzymes that appear in less than 100 % of genomes (i.e. all enzymes, since no EC numbers appeared in all genomes) to those that appear in less than 1 % of genomes.

The difference between the random and predicted correlations does change as the definition of the core group of enzymes changes.  Here’s the data aggregated for all four metagenomes in the form of a sad little Excel plot (error bars give standard deviation).

delta_correlationThis suggests to me a couple of things.  First, although I was initially surprised at the high correlation between a random and observed set of enzymes, I’m heartened that paprica consistently does better.  There’s plenty of room for improvement (and each new build of the database does improve as additional genomes are completed – the last build added 78 new genomes, see the current development version) but the method does work.  Second, that we obtain maximum “sensitivity”, defined as improvement over the random correlation, for enzymes that are present in fewer than 10 % of the genomes in that database.  Above that and the correlation is inflated (but not invalidated) by common enzymes, below that we start to lose predictive power.  This can be seen in the sharp drop in the predicted-random rho (Δrho: is it bad form to mix greek letters with the English version of same?) for enzymes present in less than 1 % of genomes.  Because lots of interesting enzymes are not very common this is where we have to focus our future efforts.  As I mentioned earlier some improvement in this area is automatic; each newly completed genome improves our resolution.

Some additional thoughts on this.  There are parameters in paprica that might improve Δrho.  The contents of closest estimated genomes are determined by a cutoff value – the fraction of descendant genomes a pathway or enzyme appears in.  I redid the Δrho calculations for different cutoff values, ranging from 0.9 to 0.1.  Surprisingly this had only a minor impact on Δrho.  The reason for this is that most of the 16S reads extracted from the metagenomes placed to closest completed genomes (for which cutoff is meaningless) rather than closest estimated genomes.  An additional consideration is that I did all of these calculations for enzyme predictions/observations instead of metabolic pathways.  The reason for this is that predicting metabolic pathways on metagenomes is rather complicated (but doable).  Pathways have the advantage of being more conserved than enzymes however, so I expect to see an improved Δrho when I get around to redoing these calculations with pathways.

Something else that’s bugging me a bit… metagenomes aren’t sets of randomly distributed genomes.  Bacterial community structure is usually logarithmic, with a few dominant taxa and a long tail of rare taxa.  The metabolic inference methods by their nature capture this distribution.  A more interesting test might be to create a logarithmically distributed random population of genomes, but this adds all kinds of additional complexities.  Chief among them being the need to create many random datasets with different (randomly selected) dominant taxa.  That seems entirely too cumbersome for this purpose…

So to summarize…

  1.  Metabolic inference definitively outperforms random selection.  This is good, but I’d like the difference (Δrho) to be larger than it is.
  2. It is not adequate to validate a metabolic inference technique using correlation with a metagenome alone.  The improvement over a randomly generated dataset should be used instead.
  3. paprica, and probably other metabolic inference techniques, have poor predictive power for rare (i.e. very taxonomically constrained) enzymes/pathways.  This shouldn’t surprise anyone.
  4. Alternate validation techniques might be more useful than correlating with the abundance of enzymes/pathways in metagenomes.  Alternatives include correlating the distance in metabolic structure between samples with distance in community structure, as we did in this paper, or correlating predictions for draft genomes.  In that case it would be necessary to generate a distribution of correlation values for the draft genome against the paprica (or other method’s) database, and see where the correlation for the inferred metabolism falls in that distribution.  Because the contents of a draft genome are a little more constrained than the contents of a metagenome I think I’m going to spend some time working on this approach…

Analysis with paprica

Mon, 01/25/2016 - 12:42

paprikaThis tutorial is both a work in progress and a living document.  If you see an error, or want something added, please let me know by leaving a comment.

I’ve been making a lot of improvements to paprica, our program for conducting metabolic inference on 16S rRNA gene sequence libraries.  The following is a complete analysis example with paprica to clarify the steps described in the manual, and to highlight some of the recent improvements to the method.  I’ll continue to update this tutorial as the method evolves.  This tutorial assumes that you have all the dependencies for installed and in your PATH.  If you’re a Mac user you can follow the instructions here.  If you’re a Linux user (including Windows users running Linux in a VirtualBox) installation is a bit simpler, just follow the instructions in the manual.  The current list of dependencies are:

  1.  Infernal (including Easel)
  2.  pplacer (including Guppy)
  3. Python version 2.7 (I strongly recommend using Anacondas)
  4. Seqmagick

It also assumes that you have the following Python modules installed:

  1.  Pandas
  2.  Joblib
  3.  Biopython

Finally, it assumes that you are using the provided database of metabolic pathways and genome data included in the ref_genome_database directory.  At some point in the future I’ll publish a tutorial describing how to use to build a custom database.  All the dependencies are installed and tested?  Before we start let’s get familiar with some terminology.

community structure: The taxonomic structure of a bacterial assemblage.

edge: An edge is a point of placement on a reference tree.  Think of it as a branch of the reference tree.  Edges take the place of OTUs in this workflow, and are ultimately far more interesting and informative than OTUs.  Refer to the pplacer documentation for more.

metabolic structure: The abundance of different metabolic pathways within a bacterial assemblage.

reference tree: This is the tree of representative 16S rRNA gene sequences from all completed Bacterial genomes in Genbank.  The topology of the tree defines what pathways are predicted for internal branch points.

Now let’s get paprica.  There are two options here.  Option 1 is to use git to download the development version.  The development version has the most recent bug fixes and features, but has not been fully tested.  That means that it’s passed a cursory test of and on my development system (a Linux workstation), but I haven’t yet validated on an independent machine (a Linux VirtualBox running on my Windows laptop).  The development version can be downloaded and made executable with:

git clone cd paprica chmod a+x *sh

Option 2 is to download the last stable version.  In this case stable means that and successfully built on the development system, successfully ran on a virtual box, and that and both successfully ran a second time from the home directory of the development machine.  The database produced by this run is the one that can be found in the latest stable version.  Downloading and making the latest stable version executable is the recommended way of getting paprica, but requires a couple of additional steps.  For the current stable release (v0.22):

wget tar -xzvf paprica_v0.22.tar.gz mv paprica-paprica_v0.22 paprica cd paprica chmod a+x *sh

Now using your text editor of choice (I recommend nano) you should open the file titled paprica_profile.txt.  This file will look something like:

## Variables necessary for the scripts associated with and # This is the location of the reference directory. ref_dir=~/paprica/ref_genome_database/ # This is the location of the covariance model used by Infernal. cm=~/paprica/ ## Variables associated with only. # This is the number of cpus RAxML will use. See RAxML manual for guidance. cpus=8 # This is the location where you want your pgdbs placed. This should match # what you told pathway-tools, or set to pathway-tools default location if # you didn't specify anything. pgdb_dir=~/ptools-local/pgdbs/user/ ## Variables associated with only. # The fraction of terminal daughters that need to have a pathway for it # to be included in an internal node. cutoff=0.5

For the purposes of this tutorial, and a basic analysis with paprica, we are only concerned with the ref_dir, cm, and cutoff variables.  The ref_dir variable is the location of the reference database.  If you downloaded paprica to your home directory, and you only intend to use, you shouldn’t need to change it.  Ditto for cm, which is the location of the covariance model used by Infernal.  The cutoff variable specifies in what fraction of genomes in a given clade a metabolic pathway needs to appear in to be assigned to a read placed to that clade.  In practice 50 % works well, but you may wish to be more or less conservative depending on your objectives.  If you want to change it simply edit that value to be whatever you want.

Now go ahead and make sure that things are working right by executing using the provided file test.fasta.  From the paprica directory:

./ test

This will produce a variety of output files in the paprica directory:

ls test* test.clean.align.sto test.clean.fasta test.edge_data.csv test.fasta test.pathways.csv test.sample_data.txt test.sum_pathways.csv

Each sample fasta file that you run will produce similar output, with the following being particularly useful to you: This is a file produced by pplacer that contains the placement information for your sample.  You can do all kinds of interesting things with sets of jplace files using Guppy.  Refer to the Guppy documentation for my details. This is a “fat” style tree showing your the placements of your query on the reference tree.  You can view this tree using Archaeopteryx.

test.edge_data.csv: This is a csv format file containing data on edge location in the reference tree that received a placement, such as the number of reads that placed, predicted 16S rRNA gene copies, number of reads placed normalized to 16S rRNA gene copies, GC content, etc.  This file describes the taxonomic structure of your sample.

test.pathways.csv: This is a csv file of all the metabolic pathways inferred for test.fasta, by placement.  All possible metabolic pathways are listed, the number attributed to each edge is given in the column for that edge.

test.sample_data.txt: This file described some basic information for the sample, such as the database version that was used to make the metabolic inference, the confidence score, total reads used, etc.

test.sum_pathways.csv: This csv format file describes the metabolic structure of the sample, i.e. pathway abundance across all edges.

Okay, that was all well and good for the test.fasta file, which has placements only for a single edge and is not particularly exciting.  Let’s try something a bit more advanced.  Create a directory for some new analysis on your home directory and migrate the necessary paprica files to it:

cd ~ mkdir my_analysis cp paprica/ my_analysis cp paprica/paprica_profile.txt my_analysis cp paprica/ my_analysis cp paprica/ my_analysis

Now add some fasta files for this tutorial.  The two fasta files are from Luria et al. 2014 and Bowman and Ducklow, 2015.  They are a summer and winter surface sample from the same location along the West Antarctic Peninsula.  I recommend wget for downloads of this sort, if you don’t have it, and don’t want to install it for some reason, use curl.

cd my_analysis wget wget

We’re cheating a bit here, because these samples have already been QC’d.  That means I’ve trimmed for quality and removed low quality reads, removed chimeras, and identified and removed mitochondria, chloroplasts, and anything that didn’t look like it belonged to the domain Bacteria.  I used Mothur for all of these tasks, but you may wish to use other tools.

Run time may be a concern for you if you have many query files to run, and/or they are particularly large.  The rate limiting step in paprica is pplacer.  We can speed pplacer up by telling to split the query fasta into several pieces that pplacer will run in parallel.  Be careful of memory useage!  pplacer creates two threads automatically when it runs, and each thread uses about 4 Gb of memory.  So if you’re system has only 2 cpus and 8 Gb of memory don’t use this option!  If you’re system has 32 Gb of RAM I’d recommend 3 splits, so that you don’t max things out.

While we’re modifying run parameters let’s make one additional change.  The two provided files have already been subsampled so that they have equal numbers of reads (1,977).  We can check this with:

grep -c '>' *fasta summer.fasta:1977 winter.fasta:1977

But suppose this wasn’t the case?  It’s generally a good idea to subsample your reads to the size of the smallest library so that you are viewing diversity evenly across samples.  You can get paprica to do this for you by specifying the number of reads should use.

To specify the number of splits and the number of reads edit the flags in

## default line #python -query $query -ref -splits 1 ## new line python -query $query -ref -splits 3 -n 1000

This will cause paprica to subsample the query file (by random selection) to 1000 reads, and split the subsampled file into three query files that will be run in parallel. The parallel run is blind to you, the output should be identical to a run with no splits (-splits 1). If you use subsampling you’ll also need to change the line, as the input file name will be slightly different.

## default line python -i $ -o $query ## new line python -i $ -o $query

Here we are only analyzing two samples, so running them manually isn’t too much of a pain. But you might have tens or hundreds of samples, and need a way to automate that. We do this with a simple loop. I recommend generating a file with the prefixes of all your query files and using that in the loop. For example the file samples.txt might have:

summer winter

This file can be inserted into a loop as:

while read f;do ./ $f done < samples.txt

Note that we don’t run them in parallel using say, gnu parallel, because Infernal is intrinsically parallelized, and we already forced pplacer to run in parallel using -splits.

Once you’ve executed the loop you’ll see all the normal paprica output, for both samples.  It’s useful to concatenate some of this information for downstream analysis.   The provided utility can do this for you.  Copy it to your working directory:

cp ~/paprica/utilities/

This script will automatically aggregate everything with the suffix .edge_data.csv.  You need to specify a prefix for the output files.

python my_analysis

This produces two files:

my_analysis.edge_data.csv: This is the mean genome parameters for each sample.  Lots of good stuff in here, see the column labels.

my_analysis.edge_tally.csv: Edge abundance for each sample (corrected for 16S rRNA gene copy).  This is your community structure, and is equivalent to an OTU table (but much better!).

To be continued…

Installing paprica on Mac OSX

Wed, 01/20/2016 - 09:55

The following is a paprica installation tutorial for novice users on Mac OSX (installation is Linux is quite a bit simpler). If you’re comfortable editing your PATH and installing things using bash you probably don’t need to follow this tutorial, just follow the instructions in the manual. If command line operations stress you out, and you haven’t dealt with a lot of weird bioinformatics program installs, use this tutorial.

Please note that this tutorial is a work in progress.  If you notice errors, inconsistencies, or omissions please leave a comment and I’ll be sure to correct them.

paprica is 90 % an elaborate wrapper script (or set of scripts) for several core programs written by other groups. The scripts that execute the pipeline are bash scripts, the scripts that do that actual work are Python. Therefor you need to get Python up and running on your system. The version that came with your system won’t suffice without heavy modification. Best to use a free third-party distro like Anaconda (preferred) or Canopy.  If you already have a mainstream v2.7 Python distro going just make sure that the biopython, joblib, and pandas modules are installed and you’re good to go.

If not please download the Anaconda distro and install it following the developer’s instructions. Allow the installer to modify your PATH variable. Once the installation is complete update it by executing:

conda update conda conda update --all

Then you’ll need to install biopython, joblib, and pandas:

conda install biopython conda install joblib conda install pandas

In case you have conflicts with other Python installations, or some other mysterious problems, it’s a good idea to test things out at this point. Open a shell, type “Python”, and you should get a welcome message that specifies Anaconda as your distro. Type:

import Bio import joblib import pandas

If you get any error messages something somewhere is wrong. Burn some incense and try again. If that doesn’t work try holy water.

One challenge with paprica on OSX has to do with the excellent program pplacer. The pplacer binary for Darwin needs the Gnu Scientific Library (GSL), specifically v1.6 (at the time of writing). You can try to compile this from source, but I’ve had trouble getting this to work on OSX. The easier option is to use a package manager, preferably Homebrew. This means however, that you have to marry one of the OSX package managers and never look back. Fink, Macports, and Homebrew will all get you a working version of GSL. I recommend using Homebrew.

To download Homebrew (assuming you don’t already have it) type:

ruby -e "$(curl -fsSL"

Follow the on-screen instructions. Once it is downloaded type:

brew install GSL

This should install the Gnu Scientific Library v1.6.

Assuming all that went okay go ahead and download the software you need to execute just the portion of paprica. First, the excellent aligner Infernal. From your home directory:

curl -O tar -xzvf infernal-1.1.1-macosx-intel.tar.gz mv infernal-1.1.1-macosx-intel infernal

Then pplacer, which also includes Guppy:

curl -O unzip mv pplacer-Darwin-v1.1.alpha17 pplacer

Now comes the tricky bit, you need to add the locations of the executables for these programs to your PATH variable. Don’t screw this up. It isn’t hard to undo screw-ups, but it will freak you out. Before you continue please read the excellent summary of shell startup scripts as they pertain to OSX here:

Assuming that you are new to the command line, and did not have a .bash_profile or .profile file already, the Anaconda install would have created .profile and added it’s executables to your path. From your home directory type:

nano .profile

Navigate to the end of the file and type:

export PATH=/Users/your-user-name/infernal/binaries:/Users/you-user-name/pplacer:${PATH}

Don’t be the guy or gal who types your-user-name. Replace with your actual user name. Hit ctrl-o to write out the file, and ctrl-x to exit nano. Re-source .profile by typing:

source .profile

Confirm that you can execute the following programs by navigating to your home directory and executing each of the following commands:

cmalign esl-alimerge pplacer guppy

You should get an error message that is clearly from the program, not a bash error like “command not found”.

Now you need to install the final dependency, Seqmagick. Confirm the most current stable release by going to Github, then download it:

curl -O tar -xzvf 0.6.1 cd 0.6.1 python install

Check the installation by typing:

seqmagick mogrify

You should get a sensible error that is clearly seqmagick yelling at you.

Okay, now you are ready to download paprica and do some analysis! Download the latest stable version of paprica (don’t just blindly download, please check Github for the latest stable release):

curl -O tar -xzvf mv paprica-paprica_v0.23 paprica

Now you need to make executable

cd paprica chmod a+x

At this point you should be ready to rock. Take a deep breath and type:

./ test

You should see a lot of output flash by on the screen, and you should find that the files test.pathways.csv, test.edge_data.csv, test.sample_data.txt, and test.sum_pathways.txt in your directory. These are the primary output files from paprica. The other files of interest are the Guppy output files and Check out the Guppy documentation for the many things you can do with jplace files. The phyloxml file is an edge fattened tree of the query placements on the reference tree. It can be viewed using Archaeopteryx or another phyloxml capable tree viewer.

To run your own analysis, say on amazing_sample.fasta, simply type:

./ amazing_sample

Please, please, please, read the manual (included in the paprica download) for further details, such as how to greatly decrease the run time on large fasta files, and how to sub-sample your input fasta. Remember that the fasta file you input should contain only reads you are reasonably sure come from bacteria (an archaeal version is a long term goal), and they should be properly QC’d (i.e. low quality ends and adapters and barcodes and such trimmed away).