Chasing Microbes in Antarctica

Subscribe to Chasing Microbes in Antarctica feed
Bowman Lab: Marine microbial ecology
Updated: 12 min 37 sec ago

Microbial session at POLAR 2018 in Davos

Tue, 10/03/2017 - 10:38

Davos, Switzerland, site of the POLAR2018 conference.  Image from

With colleagues Maria Corsaro, Eric Collins, Maria Tutino, Jody Deming, and Julie Dinasquet I’m convening a session on polar microbial ecology and evolution at the upcoming POLAR2018 conference in Davos, Switzerland.  Polar2018 is shaping up to be a unique and excellent conference; for the first time (I think) the major international Arctic science organization (IASC) is joining forces with the major international Antarctic science organization (SCAR) for a joint meeting.  Because many polar scientists specialize in one region, and thus have few opportunities to learn from the other, a joint meeting will be a great opportunity to share ideas.

Here’s the full-text description for our session.  If your work is related to microbial evolution, adaption, or ecological function in the polar regions I hope you’ll consider applying!

Unable to display PDF
Click here to download

Denitrifying bacteria and oysters

Thu, 09/21/2017 - 16:03

The eastern oyster.  Denitrification never tasted so good!  Photo from

I’m happy to be co-author on a study that was just published by Ann Arfken, a PhD student at the Virginia Institute for Marine Science (VIMS).  The study evaluated the composition of the microbial community associated with eastern oyster Crassostrea virginica to determine if oysters play a role in denitrification.  Denitrification is an ecologically significant anaerobic microbial metabolism.  In the absence of oxygen certain microbes use other oxidized compounds as electron acceptors.  Nitrate (NO3–) is a good alternate electron acceptor, and the process of reducing nitrate to nitrite (NO2), and ultimately to elemental nitrogen (N2), is called denitrification.  Unfortunately nitrate is needed by photosynthetic organisms like plants and phytoplankton, so the removal of nitrate can be detrimental to ecosystem health.  Oxygen is easily depleted in the guts of higher organisms by high rates of metabolic activity, creating a niche for denitrification and other anaerobic processes.

Predicted relative abundance (paprica) as a function of measured (qPCR) relative abundance of nosZI genes.  From Arfken et al. 2017.

To evaluate denitrification in C. virginica, Arfken et al. coupled actual measurements of denitrification in sediments and oysters with an analysis of microbial community structure in oyster shells and digestive glands.  We then used paprica with a customized database to predict the presence of denitrification genes, and validated the predictions with qPCR.

I was particularly happy to see that the qPCR results agreed well with the paprica predictions for the nosZ gene, which codes for the enzyme responsible for reducing nitrous oxide (N2O) to N2.  I believe this is the first example of qPCR being used to validate metabolic inference – which currently lacks a good method for validation.  Surprisingly however (at least to me), denitrification in C. virginica was largely associated with the oyster shell rather than the digestive gland.  We don’t really know why this is.  Arfken et al. suggests rapid colonization of the oyster shell by denitrifying bacteria that also produce antibiotic compounds to exclude predators, but further work is needed to demonstrate this!

Analyzing flow cytometry data with R

Fri, 08/11/2017 - 11:33

We recently got our CyFlow Space flow cytometer in the lab and have been working out the kinks.  From a flow cytometry perspective the California coastal environment is pretty different from the western Antarctic Peninsula where I’ve done most of my flow cytometry work.  Getting my eyes calibrated to a new flow cytometer and a the coastal California environment has been an experience.  Helping me on this task is Tia Rabsatt, a SURF REU student from the US Virgin Islands.  Tia will be heading home in a couple of weeks which presents a challenge; once she leaves she won’t have access to the proprietary software that came with the flow cytometer.  To continue analyzing the data she collected over the summer as part of her project she’ll need a different solution.

To give her a way to work with the FCS files I put together a quick R script that reads in the file, sets some event limits, and produces a nice plot.  With a little modification one could “gate” and count different regions.  The script uses the flowCore package to read in the FCS format files, and the hist2d command in gplots to make a reasonably informative plot.

library('flowCore') library('gplots') #### parameters #### <- '' # name of the file you want to analyze, file must have extension ".FCS" sample.size <- 1e5 # number of events to plot, use "max" for all points fsc.ll <- 1 # FSC lower limit ssc.ll <- 1 # SSC lower limit fl1.ll <- 1 # FL1 lower limit (ex488/em536) #### functions #### ## plotting function <- function(fcm, x.param, y.param){ hist2d(log10(fcm[,x.param]), log10(fcm[,y.param]), col = c('grey', colorRampPalette(c('white', 'lightgoldenrod1', 'darkgreen'))(100)), nbins = 200, bg = 'grey', ylab = paste0('log10(', y.param, ')'), xlab = paste0('log10(', x.param, ')')) box() } #### read in file #### fcm <- read.FCS(paste0(, '.FCS')) fcm <- #### analyze file and make plot #### ## eliminate values that are below or equal to thresholds you ## defined above fcm$SSC[fcm$SSC <= ssc.ll|fcm$FSC <= fsc.ll|fcm$FL1 == fl1.ll] <- NA fcm <- na.omit(fcm) fcm.sample <- fcm if(sample.size != 'max'){ try({fcm.sample <- fcm[sample(length(fcm$SSC), sample.size),]}, silent = T) } ## plot events in a couple of different ways, 'FSC', 'SSC'), 'FSC', 'FL1') ## make a presentation quality figure png(paste0(, '_FSC', '_FL1', '.png'), width = 2000, height = 2000, pointsize = 50), 'FSC', 'FL1')

And here’s the final plot:

OPAG comes to Scripps Institution of Oceanography

Thu, 08/03/2017 - 19:54

Artist’s rendition of the Dawn spacecraft approaching Ceres.  Original image can be found here.

I’m excited to be hosting the fall meeting of NASA’s Outer Planets Assessment Group (OPAG) here at Scripps Institution of Oceanography in September.  For planetary scientists at UCSD, SDSU, USD, and other institutions in the greater San Diego area, if you’ve never attended the OPAG meeting here’s your chance!  The meeting will be September 6 and 7 at the Samual H. Scripps Auditorium.  More details can be found here.

What are assessment groups, and what is OPAG specifically?  The assessment groups are an excellent NASA innovation to encourage dialogue within the scientific community, and between the scientific community and NASA HQ.  There’s usually a little tense dialogue – in a good way – between these two ends of the scientific spectrum.  I often wish NSF had a similar open-format dialogue with its user community!  The form of the OPAG meeting is 15 or 30 minute presentations on a range of topics relevant to the community.  These are often mission updates, planning updates for future missions, and preliminary results from the analysis of mission data.  NASA has quite a few assessment groups, ranging from the Small Body Assessment Group (SBAG – probably the AG with the catchiest acronym) to the Mars Assessment Group (MPAG).  OPAG covers everything in the solar system further from the sun than Mars.  If that covers your favorite planetary body, come and check it out!

It’s traditional to have a public evening lecture with the OPAG meeting.  For the upcoming meeting the lecture will be given at 7 pm on September 6 at the Samuel Scripps Auditorium by my friend and colleague Britney Schmidt, a planetary scientist from Georgia Tech and an expert on Europa and on Antarctic ice sheets.  Why and how one can develop that dual expertise will certainly be made clear in her talk.  There is no cost to attend, but an RSVP is recommended.  You can find more details and RSVP here.

Microbial community segmentation with R

Thu, 02/02/2017 - 18:39

In my previous post I discussed our recent paper in ISME J, in which we used community structure and flow cytometry data to predict bacterial production.  The insinuation is that if you know community structure, and have the right measure of physiology, you can make a decent prediction of any microbial ecosystem function.  The challenge is that community structure data, which often has hundreds or thousands of dimensions (taxa, OTUs, etc.), is not easily used in straightforward statistical models.   Our workaround is to reduce the community structure data from many dimensions to a single categorical variable represented by a number.  We call this process segmentation.

You could carry out this dimension reduction with pretty much any clustering algorithm; you’re simply grouping samples with like community structure characteristics on the assumption that like communities will have similar ecosystem functions.  We  use the emergent self organizing map (ESOM), a neural network algorithm, because it allows new data to be classified into an existing ESOM.  For example, imagine that you are collecting a continuous time series of microbial community structure data.  You build an ESOM to segment your first few years of data, subsequent samples can be quickly classified into the existing model.  Thus the taxonomic structure, physiological, and ecological characteristics of the segments are stable over time.  There are other benefits to use an ESOM.  One is that with many samples (far more than we had in our study), the ESOM is capable of resolving patterns that many other clustering techniques cannot.

There are many ways to construct an ESOM.  I haven’t tried a Python-based approach, although I’m keen to explore those methods.  For the ISME J paper I used the Kohonen package in R, which has a nice publication that describes some applications and is otherwise reasonably well documented.  To follow this tutorial you can download our abundance table here.  Much of the inspiration, and some of the code for this analysis, follows the (retail) customer segmentation example given here.

For this tutorial you can download a table of the closest estimated genomes and closest completed genomes (analogous to an abundance table) here.  Assuming you’ve downloaded the data into your working directory, fire up Kohonen and build the ESOM.

## Kohonen needs a numeric matrix edge.norm <- as.matrix(read.csv('community_structure.csv', row.names = 1)) ## Load the library library('kohonen') ## Define a grid. The bigger the better, but you want many fewer units in the grid ## than samples. 1:5 is a good ballpark, here we are minimal. som.grid <- somgrid(xdim = 5, ydim=5, topo="hexagonal") ## Now build the ESOM! It is worth playing with the parameters, though in ## most cases you will want the circular neighborhood and toroidal map structure. som.model.edges <- som(edge.norm,                  grid = som.grid,                  rlen = 100,                  alpha = c(0.05,0.01),         = TRUE,                  n.hood = "circular",                  toroidal = T)

Congratulations!  You’ve just constructed your first ESOM.  Pretty easy.  You’ve effectively clustered the samples into the 25 units that define the ESOM.  You can visualize this as such:

plot(som.model.edges, type = 'mapping', pch = 19)

There are the 25 map units, with the toroid split and flattened into 2D.  Each point is a sample (row in the abundance table), positioned in the unit that best reflects its community structure.  I’m not going to go into any depth on the ESOM algorithm, which is quite elegant, but the version implemented in the Kohonen package is based on Euclidean distance.  How well each map unit represents the samples positioned within it is represented by the distance between the map unit and each sample.  This can be visualized with:

plot(som.model.edges, type = 'quality', pch = 19, = topo.colors)

Units with shorter distances in the plot above are better defined by the samples in those units than units with long distances.  What distance is good enough depends on your data and objectives.

The next piece is trickier because there’s a bit of an art to it.  At this point each sample has been assigned to one of the 25 units in the map.  In theory we could call each map unit a “segment” and stop here.  It’s beneficial however, to do an additional round of clustering on the map units themselves.  Particularly on large maps (which clearly this is not) this will highlight major structural features in the data.  Both k-means and hierarchical clustering work fairly well, anecdotally k-means seems to work better with smaller maps and hierarchical with larger maps, but you should evaluate for your data.  Here we’ll use k-means.  K-means requires that you specify the number of clusters in advance, which is always a fun chicken and egg problem.  To solve it we use the within-clusters sum of squares method:

wss.edges <- (nrow(som.model.edges$codes)-1)*sum(apply(som.model.edges$codes,2,var)) for (i in 2:15) {   wss.edges[i] <- sum(kmeans(som.model.edges$codes, centers=i)$withinss) } plot(wss.edges,      pch = 19,      ylab = 'Within-clusters sum of squares',      xlab = 'K')

Here’s where the art comes in.  Squint at the plot and try to decide the inflection point.  I’d call it 8, but you should experiment with whatever number you pick to see if it makes sense downstream.

We can make another plot of the map showing which map units belong to which clusters:

k <- 8 som.cluster.edges <- kmeans(som.model.edges$codes, centers = k) plot(som.model.edges,      main = '',      type = "property",      property = som.cluster.edges$cluster, = topo.colors) add.cluster.boundaries(som.model.edges, som.cluster.edges$cluster)

Remember that the real shape of this map is a toroid and not a square.  The colors represent the final “community segmentation”; the samples belong to map units, and the units belong to clusters.  In our paper we termed these clusters “modes” to highlight the fact that there are real ecological properties associated with them, and that (unlike clusters) they support classification.  To get the mode of each sample we need to index the sample-unit assignments against the unit-cluster assignments.  It’s a little weird until you get your head wrapped around it:

som.cluster.edges$cluster[som.model.edges$unit.classif] [1] 5 7 7 5 2 7 5 3 7 5 2 6 1 1 1 7 5 4 7 7 5 7 7 7 7 7 7 1 4 4 4 4 7 7 7 6 6 6 6 1 1 1 7 5 5 5 1 1 1 5 5 7 7 4 8 7 7 4 7 8 [61] 7 7 7 7 6 5 6 7 7 7 6 4 6 5 4 4 6 2 1 1 1 1 1 4 1 4 4 4

A really important thing to appreciate about these modes is that they are not ordered or continuous.  Mode 4 doesn’t necessarily have more in common with mode 5 say, than with mode 1.  For this reason it is important to treat the modes as factors in any downstream analysis (e.g. in linear modeling).  For our analysis I had a dataframe with bacterial production, chlorophyll concentration, and bacterial abundance, and predicted genomic parameters from paprica.  By saving the mode data as a new variable in the dataframe, and converting the dataframe to a zoo timeseries, it was possible to visualize the occurrence of modes, model the data, and test the pattern of modes for evidence of succession.  Happy segmenting!


Always more

New paper published in ISME Journal

Fri, 01/20/2017 - 19:36

I’m happy to report that a paper I wrote during my postdoc at the Lamont-Doherty Earth Observatory was published online today in the ISME Journal.  The paper, Bacterial community segmentation facilitates the prediction of ecosystem function along the coast of the western Antarctic Peninsula, uses a novel technique to “segment” the microbial community present in many different samples into a few groups (“modes”) that have specific functional, ecological, and genomic attributes.  The inspiration for this came when I stumbled across this blog entry on an approach used in marketing analytics.  Imagine that a retailer has a large pool of customers that it would like to pester with ads tailored to purchasing habits.  It’s too cumbersome to develop an individualized ad based on each customer’s habits, and it isn’t clear what combination of purchasing-habit parameters accurately describe meaningful customer groups.  Machine learning techniques, in this case emergent self-organizing maps (ESOMs), can be used to sort the customers in a way that optimizes their similarity and limits the risk of overtraining the model (including parameters that don’t improve the model).

In a 2D representation of an ESOM, the customers most like one another will be organized in geographically coherent regions of the map.  Hierarchical or k-means clustering can be superimposed on the map to clarify the boundaries between these regions, which in this case might represent customers that will respond similarly to a targeted ad.  But what’s really cool about this whole approach is that, unlike with NMDS or PCA or other multivariate techniques based on ordination, new customers can be efficiently classified into the existing groups.  There’s no need to rebuild the model unless a new type of customer comes along, and it is easy to identify when this occurs.

Back to microbial ecology.  Imagine that you have a lot of samples (in our case a five year time series), and that you’ve described community structure for these samples with 16S rRNA gene amplicon sequencing.  For each sample you have a table of OTUs, or in our case closest completed genomes and closest estimated genomes (CCGs and CEGs) determined with paprica.  You know that variations in community structure have a big impact on an ecosystem function (e.g. respiration, or nitrogen fixation), but how to test the correlation?  There are statistical methods in ecology that get at this, but they are often difficult to interpret.  What if community structure could be represented as a simple value suitable for regression models?

Enter microbial community segmentation.  Following the customer segmentation approach described above, the samples can be segmented into modes based on community structure with an Emergent Self Organizing Map and k-means clustering.  Here’s what this looks like in practice:

From Bowman et al. 2016.  Segmentation of samples based on bacterial community structure.  C-I show the relative abundance of CEGs and CCGs in each map unit.  This value was determined iteratively while the map was trained, and reflects the values for samples located in each unit (B).

This segmentation reduces the data for each sample from many dimensions (the number of CCG and CEG present in each samples) to 1.  This remaining dimension is a categorical variable with real ecological meaning that can be used in linear models.  For example, each mode has certain genomic characteristics:

From Bowman et al. 2016.  Genomic characteristics of modes (a and b), and metabolic pathways associated with taxa that account for most of the variations in composition between modes (d).

In panel a above we see that samples belonging to modes 5 and 7 (dominated by the CEG Rhodobacteraceae and CCG Dokdonia MED134, see Fig. 2 above) have the greatest average number of 16S rRNA gene copies.  Because this is a characteristic of fast growing, copiotrophic bacteria, we might also associate these modes with high levels of bacterial production.

Because the modes are categorical variables we can insert them right into linear models to predict ecosystem functions, such as bacterial production.  Combined with bacterial abundance and a measure of high vs. low nucleic acid bacteria, mode accounted for 76 % of the variance in bacterial production for our samples.  That’s a strong correlation for environmental data.  What this means in practice is; if you know the mode, and you have some flow cytometry data, you can make a pretty good estimate of carbon assimilation by the bacterial community.

For more on what you can do with modes (such as testing for community succession) check out the article!  I’ll post a tutorial on how to segment microbial community structure data into modes using R in a separate post.  It’s easier than you think…

paprica v0.4.0

Sun, 01/08/2017 - 15:04

I’m happy to announce the release of paprica v0.4.0.  This release adds a number of new features to our pipeline for evaluating microbial community and metabolic structure.  These include:

  • NCBI taxonomy information for each point of placement on the reference tree, including internal nodes.
  • Inclusion of the domain Eukarya.  This was a bit tricky and requires some further explanation.

The distribution of metabolic pathways, predicted during the creation of the paprica Eukarya database, across transcriptomes in the MMETSP.

Eukaryotic genomes are a totally different beast than their archaeal and bacterial counterparts.  First and foremost they are massive.  Because of these there aren’t very many completed eukaryotic genomes out there, particularly for singled celled eukaryotes.  While a single investigator can now sequence, assemble, and annotate a bacterial or archaeal genome in very little time, eukaryotic genomes still require major efforts by consortia and lots of $$.

One way to get around this scale problem is to focus on eukaryotic transcriptomes instead of genomes.  Because much of the eukaryotic genome is noncoding this greatly reduces sequencing volume.  Since there is no such thing as a contiguous transcriptome, this approach also implies that no assembly (beyond open reading frames) will be attempted.  The Moore Foundation-funded Marine Microbial Eukaryotic Transcriptome Sequencing Project (MMETSP) was an initial effort to use this approach to address the problem of unknown eukaryotic genetic diversity.  The MMETSP sequenced transcriptomes from several hundred different strains.  The taxonomic breadth of the strains sequenced is pretty good, even if (predictably) the taxonomic resolution is not.  Thus, as for archaea, the phylogenetic tree and metabolic inferences should be treated with caution.  For eukaryotes there are the additional caveats that 1) not all genes coded in a genome will be represented in the transcriptome 2) the database contains only strains from the marine environment and 3) eukaryotic 18S trees are kind of messy.  Considerable effort went into making a decent tree, but you’ve been warned.

Because the underlying data is in a different format, not all genome parameters are calculated for the eukaryotes.  18S gene copy number is not determined (and thus community and metabolic structure are not normalized), the phi parameter, GC content, etc. are also not calculated.  However, eukaryotic community structure is evaluated and metabolic structure inferred in the same way as for the domains bacteria and archaea:

./ test.eukarya eukarya

As always you can install paprica v0.4.0 by following the instructions here, or you can use the virtual box or Amazon Web Service machine instance.

paprica on the cloud

Fri, 11/25/2016 - 22:15

This is a quick post to announce that paprica, our pipeline to evaluate community structure and conduct metabolic inference, is now available as an Amazon Machine Instance (AMI) on the cloud.  The AMI comes with all dependencies required to execute the script pre-installed.  If you want to use it for you’ll have to install pathway-tools and a few additional dependencies.  I’m new to the Amazon EC2 environment, so please let me know if you have any issues using the AMI.

If you are new to Amazon Web Services (AWS) the basic way this works is:

  • Sign up for Amazon EC2 using your normal Amazon log-in
  • From the AWS console, make sure that your region is N. Virginia (community AMI’s are only available in the region they were created in)
  • From your EC2 dashboard, scroll down to “Create Instance” and click “Launch Instance”
  • Now select the “Community AMIs”
  • Search for paprica-ec2, then “Select”
  • Choose the type of instance you would like to run the AMI on.  This is the real power of AWS; you can tailor the instance to the analysis you would like to run.  For testing choose the free t2.micro instance.  This is sufficient to execute the test files or run a small analysis (hundreds of reads).  To use paprica’s parallel features select an instance with the desired number of cores and sufficient memory.
  • Click “Review and Launch”, and finish setting up the instance as appropriate.
  • Log onto the instance, navigate to the paprica directory, execute the test file(s) as described in the paprica tutorial.

Antarctic Long Term Ecological Research

Thu, 10/13/2016 - 12:01

The Palmer and McMurdo Long Term Ecological Research (LTER) projects; separate but equal… at least in terms of interesting ecosystem dynamics if not in terms of biomass!

I’m very excited that our manuscript “Microbial community dynamics in two polar extremes: The lakes of the McMurdo Dry Valleys and the West Antarctic Peninsula Marine Ecosystem” has been published as an overview article in the journal BioScience.  The article belongs to a special issue comparing different ecological aspects of the two NSF-funded Long Term Ecological Research (LTER) sites in Antarctica.  I’m actually writing this post on my return trip from the first ever meeting of the International Long Term Ecological Research (ILTER) network at Kruger National Park in South Africa (an excellent place to ponder ecological questions).

This article had an odd genesis; the special issue was conceived by John Priscu, a PI with the McMurdo LTER project.  I was ensnared in the project along with Trista Vick-Majors, a graduate student with John Priscu (now a postdoctoral scholar at McGill University), shortly after starting my postdoc with Hugh Ducklow, PI on the Palmer LTER project.  The guidance we received was more or less “compare the McMurdo and Palmer LTERs”.  How exactly we should compare perennially ice-covered lakes in a polar desert to one of the richest marine ecosystems on the planet was left up to us.  Fortunately, microbial ecology lends itself to highly reductionist thinking.   This isn’t always helpful, but we reasoned that on a basal level the two ecosystems must function more or less the same.  Despite dramatically different physical settings, both environments host communities of phytoplankton (sometimes even similar taxonomic groups).  These convert solar energy into chemical energy and CO2 into organic carbon, thereby supporting communities of heterotrophic bacteria and grazers.

To look at the details of this we stretched the bounds of what constitutes an “overview article” and aggregated nearly two decades of primary production and bacterial production data collected by the McMurdo LTER, and over a decade of the same from the Palmer LTER.  By looking at the ratio of bacterial production to primary production we assessed how much carbon the heterotrophic bacterial community takes up relative to how much the phytoplankton community produces.

Some stuff

Figure from Bowman et al., 2016, BioScience.  A) Depth-integrated bacterial (BP) and primary production (PP) for the Palmer LTER study area and several lakes in Taylor Valley.  B)  The region occupied by the mean and standard deviation for discrete points (too many points to show).  C) The distribution of BP:PP for each site.

Typical marine values for this ratio are 1:10.  At a value of around 1:5 the carbon demands of heterotrophic bacteria are probably not met by phytoplankton production (the majority of carbon taken up by bacteria is lost through respiration and is not accounted for in the bacterial production assay).  Most of the lakes hover around 1:5, with values above this fairly common.  Lake Fryxell however, an odd lake at the foot of Canada Glacier, has values that often exceed 1:1!  Consistent with previous work on the lakes such high rates of bacterial production (relative to primary production) can only be met by a large external carbon subsidy.

Where does this external carbon come from?  Each summer the McMurdo Dry Valleys warm up enough that the various glaciers at the valley peripheries begin to melt.  This meltwater fuels chemoautotrophic bacterial communities where the glacier meets rock (the subglacial environment), and microbial mats in various streams and melt ponds.  Like microbial communities everywhere these bleed a certain amount of dissolved carbon (and particulate; DOC and POC) into the surrounding water.  Some of this carbon ends up in the lakes where it enhances bacterial production.

But external carbon subsidies aren’t the only part of the story.  Nutrients, namely phosphate and nitrate, are washed into the lakes as well.  During big melt years (such as the summer of 2001-2002 when a major positive SAM coupled to an El Nino caused unusually high temperatures) the lakes receives big pulses of relatively labile carbon but also inorganic nutrients and silt.  This odd combination has the effect of suppressing primary production in the near term through lowered light levels (all that silt), enhancing it in the long term (all those nutrients), and giving heterotrophic bacteria some high quality external carbon to feed on during the period that primary production is suppressed.  Or at least that’s how we read it.

Not a lake person?  How do things work over in the Palmer LTER?  One of the biggest ecological differences between Palmer and McMurdo is that the former has grazers (e.g. copepods, salps, and krill) and the latter does not, or at least not so many to speak off.  Thus an argument can be made that carbon dynamics at Palmer are driven (at least partially) by top-down controls (i.e. grazers), while at McMurdo they are dependent almost exclusively on bottom-up (i.e. chemical and physical) controls.

At times the difference between bacterial production and primary production is pretty extreme at Palmer.  In the summer of 2006 for example, bacterial production was only 3 % of primary production (see Fig. 4 in the publication), and the rate of primary production that summer was pretty high.  The krill population was also pretty high that year; at the top of their 4-year abundance cycle (see Saba et al. 2014, Nature Communications).  This is speculative, but I posit that bacterial production was low in part because a large amount of carbon was being transferred via krill to the higher trophic levels and away from bacteria.  This is a complicated scenario because krill can be good for bacteria; sloppy feeding produces DOC and krill excrete large amounts of reduced nitrogen and DOC.  Krill also build biomass and respire however, and their large fecal pellets sink quickly, these could be significant losses of carbon from the photic zone.

Antarctica is changing fast and in ways that are difficult to predict.  Sea ice seems to be growing in the east Antarctic as it is lost from the west Antarctic, and anomalous years buck this trend in both regions.  A major motivation for this special issue was to explore how the changing environment might drive ecological change.  I have to say that after spending a good portion of the (boreal) summer and fall thinking about this, some of that time from the vantage point of Palmer Station, I have no idea.  All of the McMurdo Lakes react differently to anomalous years, and Palmer as a region seems to react differently to each of abnormal year.  I think the krill story is an important concept to keep in mind here; ecological responses are like superimposed waveforms.  Picture a regularly occurring phenomenon like the El-Nino Southern Oscillation imposing a periodicity on sea ice cover, which we know has a strong influence on biology.  Add a few more oscillating waves from other physical processes.  Now start to add biological oscillations like the four-year krill abundance cycle.  Can we deconvolute this mess to find a signal?  Can we forecast it forward?  Certainly not with 10 years of data at one site and 20 years at the other (and we’re so proud of these efforts!).  Check back next century… if NSF funds these sites that long…

Many thanks to my co-authors for going the distance on this paper, particularly the lake people for many stimulating arguments.  I think limnology and oceanography are, conceptually, much less similar than lakes and oceans.

Creating a landmask for the West Antarctic Peninsula in R

Sun, 09/04/2016 - 08:50

Silicate concentration in µM at 1m depth during the 2014 Palmer LTER cruise.  This plot is lying to you.  The interpolations extend past islets and into landmasses.

This is going to be a pretty niche topic, but probably useful for someone out there.  Lately I’ve been working with a lot of geospatial data for the West Antarctic Peninsula.  One of the things that I needed to do was krig the data (krigging is a form of 2D interpolation, I’m using the pracma library for this).  Krigging is a problem near coastlines because it assumes a contiguous space to work in.  If there happens to be an island or other landmass in the way there is no way to represent the resulting discontinuity in whatever parameter I’m looking at.  Because of this I needed to find a way to mask the output.  This doesn’t really solve the problem, but at least it allows me to identify areas of concern (for example interpolation that extends across an isthmus, if there are sample points only on one side.

I’m krigging and building maps entirely inside R, which has somewhat immature packages for dealing with geospatial data.  The easiest masking solution would be to use filled polygons from any polygon format shapefile that accurately represents the coastline.  Unfortunately I couldn’t find an R package that does this correctly with the shapefiles that I have access too.  In addition, because of my downstream analysis it was better to mask the data itself, and not just block out landmasses in the graphical output.

Sharon Stammerjohn at the NSIDC pointed me to the excellent Bathymetry and Global Relief dataset produced by NOAA.  This wasn’t a complete solution to the problem but it got me moving in the right direction.  From the custom grid extractor at I selected a ETOPO1 (bedrock) grid along the WAP, with xyz as the output format.  If you’re struggling with shapefiles the xyz format is like a cool drink of water, being a 3-column matrix of longitude, latitude, and height (or depth).  For the purpose of creating the mask I considered landmass as any lat-long combination with height > 0.

There is one more really, really big twist to what I was trying to do, however.  The Palmer LTER uses a custom 1 km pixel grid instead of latitude-longitude.  It’s a little easier to conceptualize than lat-long given the large longitude distortions at high latitude (and the inconvenient regional convergence of lat-long values on similar negative numbers).  It is also a little more ecologically relevant, being organized parallel to the coastline instead of north to south.  Unfortunately this makes the grid completely incompatible with other Euclidean reference systems such as UTM.  So before I could use my xyz file to construct a land mask I needed to convert it to the line-station grid system used by the Palmer LTER.  If you’re working in lat-long space you can skip over this part.


The Palmer LTER grid provides a convenient geospatial reference for the study area, but converting between line (y) and station (x) coordinates and latitude-longitude is non-trivial.

Many moons ago someone wrote a Matlab script to convert lat-long to line-station which you can find here.  Unfortunately I’m not a Matlab user, nor am I inclined to become one.  Fortunately it was pretty straightforward to copy-paste the code into R and fix the syntatic differences between the two languages.  Three functions in total are required:

## AUTHORS OF ORIGINAL MATLAB SCRIPT: #   Richard A. Iannuzzi #   Lamont-Doherty Earth Observatory # #   based on: LTERGRID program written by Kirk Waters (NOAA Coastal Services Center), February 1997 ## some functions that are used by the main function SetStation <- function(e, n, CENTEREAST, CENTERNORTH, ANGLE){   uu = e - CENTEREAST   vv = n - CENTERNORTH   z1 = cos(ANGLE)   z2 = sin(ANGLE)   NorthKm = (z1 * uu - z2 *vv) / 1000 + 600   EastKm = (z2 * uu + z1 * vv) / 1000 + 40     return(c(NorthKm, EastKm)) } CentralMeridian <- function(iz){   if(abs(iz) > 30){     iutz = abs(iz) - 30     cm = ((iutz * 6.0) -3.0) * -3600   }   else{     iutz = 30 - abs(iz)     cm = ((iutz * 6.0) +3.0) * +3600   }   return(cm) } GeoToUTM <- function(lat, lon, zone){   axis = c(6378206.4,6378249.145,6377397.155,           6378157.5,6378388.,6378135.,6377276.3452,           6378145.,6378137.,6377563.396,6377304.063,           6377341.89,6376896.0,6378155.0,6378160.,           6378245.,6378270.,6378166.,6378150.)     bxis = c(6356583.8,6356514.86955,6356078.96284,           6356772.2,6356911.94613,6356750.519915,6356075.4133,           6356759.769356,6356752.31414,6356256.91,6356103.039,           6356036.143,6355834.8467,6356773.3205,6356774.719,           6356863.0188,6356794.343479,6356784.283666,           6356768.337303)     ak0 = 0.9996   radsec = 206264.8062470964     sphere = 9     a = axis[sphere - 1]             # major axis size   b = bxis[sphere - 1]             # minior axis size   es = ((1-b^2/a^2)^(1/2))^2      # eccentricity squared   slat = lat * 3600                # latitude in seconds   slon = -lon * 3600               # longitude in seconds   cm = 0                           # central meridian in sec   iutz = 0     cm = CentralMeridian(zone)       # call the function     phi = slat/radsec   dlam = -(slon - cm)/radsec   epri = es/(1.0 - es)   en = a/sqrt(1.0 - es * sin(phi)^2)   t = tan(phi)^2   c = epri * cos(phi)^2   aa = dlam * cos(phi)   s2 = sin(2.0 * phi)   s4 = sin(4.0 * phi)   s6 = sin(6.0 * phi)   f1 = (1.0 - (es/4.)-(3.0*es*es/64.)-(5.0*es*es*es/256))   f2 = ((3*es/8)+(3.0*es*es/32)+(45*es*es*es/1024))   f3 = ((15*es*es/256)+(45*es*es*es/1024))   f4 = (35*es*es*es/3072)   em = a*(f1*phi - f2*s2 + f3*s4 - f4*s6)   xx = ak0 * en * (aa + (1.-t+c) * aa^3/6 + (5 - 18*t + t*t + 72*c-58*epri)* aa^5/120) + 5e5   yy = ak0 * (em + en * tan(phi) *((aa*aa/2) + (5-t+9*c+4*c*c)*aa^4/24 + (61-58*t +t*t +600*c - 330*epri)* aa^6/720))     if(zone < 0 | slat < 0){     yy = yy + 1e7   }     return(c(xx, yy)) } ## This function actually works with your data ll2gridLter <- function(inlat, inlon){   NorthKm = 0           # initialize   EastKm = 0            # initialize   zone = -20            # set zone (for LTER region, I think)   ANGLE = -50 * pi / 180   CENTEREAST = 433820.404        # eastings for station 600.040   CENTERNORTH = 2798242.817     # northings for station 600.040     # take latitude longitude and get station   x.y = GeoToUTM(inlat, inlon, zone)   NorthKm.EastKm = SetStation(x.y[1], x.y[2], CENTEREAST, CENTERNORTH, ANGLE)   return(NorthKm.EastKm) }

Once the functions are defined I used them to convert the lat/long coordinates in the xyz file to line-station.

## Read in xyz file. lat.long.depth <- read.table('', header = F, col.names = c('long', 'lat', 'depth')) ## Limit to points above sea level. <- lat.long.depth[which(lat.long.depth$depth >= 0),] ## Create a matrix to hold the output. <- matrix(ncol = 3, nrow = length($long)) colnames(line.station.depth) <- c('line', 'station', 'depth') ## Execute the ll2gridLter function on each point. Yes, I'm using a loop to do this. for(i in 1:length($long)){[i,] <- c(ll2gridLter($lat[i],$long[i]),$depth[i])   print(paste(c(i,[i,]))) } ## Write out the matrix. write.csv(, 'palmer_grid_landmask.csv', row.names = F, quote = F)

At this point I had a nice csv file with line, station, and elevation.  I was able to read this into my existing krigging script and convert into a mask.

## Read in csv file. landmask <- read.csv('palmer_grid_landmask.csv') ## Limit to the lines and stations that I'm interested in. landmask <- landmask[which(landmask[,1] <= 825 & landmask[,1] >= -125),] landmask <- landmask[which(landmask[,2] <= 285 & landmask[,2] >= -25),] ## Interpolated data is at 1 km resolution, need to round off ## to same resolution here. landmask.expand <- cbind(ceiling(landmask[,1]), ceiling(landmask[,2])) ## Unfortunately this doesn't adequately mask the land. Need to expand the size of each ## masked pixel 1 km in each direction. landmask.expand <- rbind(landmask.expand, cbind(floor(landmask[,1]), floor(landmask[,2]))) landmask.expand <- rbind(landmask.expand, cbind(ceiling(landmask[,1]), floor(landmask[,2]))) landmask.expand <- rbind(landmask.expand, cbind(floor(landmask[,1]), ceiling(landmask[,2]))) landmask.expand <- unique(landmask.expand)

I’m not going to cover how I did the krigging in this post.  My krigged data is in matrix called temp.pred.matrix with colnames given by ‘x’ followed by ‘station’, as in x20 for station 20, and row names ‘y’ followed by ‘line’, as in y100 for line 100.  To convert interpolated points that are actually land to NA values I simply added this line to my code:

temp.pred.matrix[cbind(paste0('y', landmask.expand[,1]), paste0('x', landmask.expand[,2] * -1))]

Here’s what the krigged silicate data looks like after masking.


Silicate concentration in µM at 1m depth during the 2014 Palmer LTER cruise after masking the underlying data.

Excellent.  The masked area corresponds with known landmasses; that’s Adelaide Island (home of Rothera Station) at the bottom of the Peninsula, and various islets and the western edge of the Antarctic Peninsula to the northeast.  At this point erroneous data has been eliminated from the matrix.  Annual inventories and such can be more accurately calculated form the data and our eyes are drawn to interesting features in the interpolation that have no chance of reflecting reality because they are over land.  The white doesn’t look that appealing to me in this plot however, so I masked the land with black by adding points to the plot.  Again, I’m not going to show the whole plotting routine because some variables called would require a lengthy explanation about how the larger data is structured.  The plot was created using imagep in the oce package.  This command automatically transposes the matrix.

## Show masked points as black squares. points(landmask.expand[,1] ~ {-1 * landmask.expand[,2]}, pch = 15, cex = 0.6)

And the final plot:


Silicate concentration in µM at 1m depth during the 2014 Palmer LTER cruise after masking the underlying data and adding points to indicate the masked area.



Astrobiology Primer v2

Sat, 08/20/2016 - 23:16

The long-awaited version 2 of the Astrobiology Primer was published (open access) yesterday in the journal Astrobiology.  I’m not sure who first conceived of the Astrobiology Primer, first published in 2006, but the v2 effort was headed by co-lead editors Shawn Domagal-Goldman at NASA and Katherine Wright  at the UK Space Agency.  The Primer v2 was a long time in coming; initial section text was submitted back in early 2011!  The longer these projects go on, the easy it is for them to die.  Many thanks to Shawn and Katherine for making sure that this didn’t happen!

The downside of course, is that the primer ran the risk of being outdated before it even went into review.  This was mitigated somewhat by the review process itself, and authors did have a chance to update their various sections.  Some sections are more stable than others; the section that I wrote with Shawn McGlynn (now at the Tokyo Institute of Technology) on how life uses energy for example, covers some fairly fundamental ground and is likely to stand the test of time.  Less so for sections that cover planetary processes in and outside of our solar system; paradigms are being broken in planetary science as fast as they form!

The Astrobiology Primer is a very valuable document because it takes a complicated and interdisciplinary field of study and attempts to summarize it for a broad audience.  Most of the Primer should be accessible to anyone with a basic grasp of science.  I wonder if it could even serve as a model for other disciplines.  What if the junior scientists in every discipline (perhaps roughly defined by individual NSF or NASA programs) got together once every five years to write an open-access summary of the major findings in their field?  This might provide a rich and colorful counterpoint to the valuable but often [dry, esoteric, top-down?  There’s an adjective that I’m searching for here but it escapes me] reports produced by the National Academies.

The co-lead editors were critical to the success of the Primer v2.  I haven’t explicitly asked Shawn and Kaitlin if they were compensated in any way for this activity – perhaps they rolled some of this work into various fellowships and such over the years.  More likely this was one more extracurricular activity carried out on the side.  Such is the way science works, and the lines are sufficiently blurred between curricular and extracurricular that most of us don’t even look for them anymore.  In recognition of this, and to speed the publication and heighten the quality of a future Primer v3, it would be nice to see NASA produce a specific funding call for a (small!) editorial team.  Three years of partial salary and funding for a series of writing workshops would make a huge difference in scope and quality.

How I learned to stop worrying and love subsampling (rarifying)

Tue, 07/12/2016 - 12:53

I have had the 2014 paper “Waste Not, Want Not: Why Rarefying Microbiome Data is Inadmissable” by McMurdie and Holmes sitting on my desk for a while now.  Yesterday I finally got around to reading it and was immediately a little skeptical as a result of the hyperbole with which they criticized the common practice of subsampling* libraries of 16S rRNA gene reads during microbial community analysis.  The logic of that practice proceeds like this:

To develop 16S rRNA gene data (or any other marker gene data) describing a microbial community we generally collect an environmental sample, extract the DNA, amplify it, and sequence the amplified material.  The extract might contain the DNA from 1 billion microbial cells present in the environment.  Only a tiny fraction (<< 1 %) of these DNA molecules actually get sequenced; a “good” sequence run might contain only tens of thousands of sequences per sample.  After quality control and normalizing for multiple copies of the 16S rRNA gene it will contain far fewer.  Thus the final dataset contains a very small random sample of the original population of DNA molecules.

Most sequence-based microbial ecology studies involve some kind of comparison between samples or experimental treatments.  It makes no sense to compare the abundance of taxa between two datasets of different sizes, as the dissimilarity between the datasets will appear much greater than it actually is.  One solution is to normalize by dividing the abundance of each taxa by the total reads in the dataset to get their relative abundance.  In theory this works great, but has the disadvantage that it does not take into account that a larger dataset has sampled the original population of DNA molecules deeper.  Thus more rare taxa might be represented.  A common practice is to reduce the amount of information present in the larger dataset by subsampling to the size of the smaller dataset.  This attempts to approximate the case where both datasets undertake the same level of random sampling.

McMurdie and Holmes argue that this approach is indefensible for two common types of analysis; identifying differences in community structure between multiple samples or treatments, and identifying differences in abundance for specific taxa between samples and treatments.  I think the authors do make a reasonable argument for the latter analysis; however, at worst the use of subsampling and/or normalization simply reduces the sensitivity of the analysis.  I suspect that dissimilarity calculations between treatments or samples using realistic datasets are much less sensitive to reasonable subsampling than the authors suggest.  I confess that I am (clearly) very far from being a statistician and there is a lot in McMurdie and Holmes, 2014 that I’m still trying to digest.  I hope that our colleagues in statistics and applied mathematics continue optimizing these (and other) methods so that microbial ecology can improve as a quantitative science.  There’s no need however, to get everyone spun up without a good reason.  To try and understand if there is a good reason, at least with respect to my data and analysis goals, I undertook some further exploration.  I would strongly welcome any suggestions, observations, and criticisms to this post!

My read of McMurdi and Homes is that the authors object to subsampling because it disregards data that is present that could be used by more sophisticated (i.e. parametric) methods to estimate the true abundance of taxa.  This is true; data discarded from larger datasets does have information that can be used to estimate the true abundance of taxa among samples.  The question is how much of a difference does it really make?  McMurdi and Holmes advocate using methods adopted from transcriptome analysis.  These methods are necessary for transcriptomes because 1) the primary objective of the study is not usually to see if one transcriptome is different from another, but which genes are differentially expressed and 2) I posit that the abundance distribution of transcript data is different than the taxa abundance distribution.  An example of this can be seen in the plots below.

Taxon abundance, taken from a sample that I’m currently working with.













Transcript abundance by PFAM, taken from a sample (selected at random) from the MMETSP.

In both the cases the data is log distributed, with a few very abundant taxa or PFAMs and many rare ones.  What constitutes “abundant” in the transcript dataset however, is very different than “abundant” in the community structure dataset.  The transcript dataset is roughly half the size (n = 9,000 vs. n = 21,000), nonetheless the most abundant transcript has an abundance of 209.  The most abundant OTU has an abundance of 6,589, and there are several very abundant OTUs.  Intuitively this suggests to me that the structure of the taxon dataset is much more reproducible via subsampling than the structure of the transcript dataset, as the most abundant OTUs have a high probability of being sampled.  The longer tail of the transcript data contributes to this as well, though of course this tail is controlled to a large extent by the classification scheme used (here PFAMs).

To get an idea of how reproducible the underlying structure was for the community structure and transcript data I repeatedly subsampled both (with replacement) to 3000 and 1300 observations, respectively.  For the community structure data this is about the lower end of the sample size I would use in an actual analysis – amplicon libraries this small are probably technically biased and should be excluded (McMurdi and Holmes avoid this point at several spots in the paper).  The results of subsampling are shown in these heatmaps, where each row is a new subsample.  For brevity only columns summing to an abundance > 100 are shown.

otu_abundance_heat trans_abundance_heatIn these figures the warmer colors indicate a higher number of observations for that OTU/PFAM.  The transcript data is a lot less reproducible than the community structure data; the Bray-Curtis dissimilarity across all iterations maxes out at 0.11 for community structure and 0.56 for the transcripts.  The extreme case would be if the data were normally distributed (i.e. few abundant and few rare observations, many intermediate observations).  Here’s what subsampling does to normally distributed data (n = 21,000, mean = 1000, sd = 200):


If you have normally distributed data don’t subsample!

For the rest of us it seems that for the test dataset used here community structure is at least somewhat reproducible via subsampling.  There are differences between iterations however, what does this mean in the context of the larger study?

The sample was drawn from a multiyear time series from a coastal marine site.  The next most similar sample (in terms of community composition and ecology) was, not surprisingly, a sample taken one week later.  By treating this sample in an identical fashion, then combining the two datasets, it was possible to evaluate how easy it is to tell the two samples apart after subsampling.  In this heatmap the row colors red and black indicate iterations belonging to the two samples:

Clustering of repeated subsamplings from two similar samples.  Sample identity is given by the red or black color along the y-axis.


As this heatmap shows, for these two samples there is perfect fidelity.  Presumably with very similar samples this would start to break down, determining how similar samples need to be before they cannot be distinguished at a given level of subsampling would be a useful exercise.  The authors attempt to do this in Simlation A/Figure 5 in the paper, but it isn’t clear to me why their results are so poor – particularly given very different sample types and a more sophisticated clustering method than I’ve applied here.

As a solution – necessary for transcript data, normally distributed data, or for analyses of differential abundance, probably less essential for comparisons of community structure – the authors propose a mixture model approach that takes in account variance across replicates to estimate “real” OTU abundance.  Three R packages that can do this are mentioned; edgeR, DESeq, and metagenomeSeq.  The problem with these methods – as I understand them – is that they require experimental replicates.  According to the DESeq authors, technical replicates should be summed, and samples should be assigned to treatment pools (e.g. control, treatment 1, treatment 2…).  Variance is calculated within each pool and this is used to to model differences between pools.  This is great for a factor-based analysis, as is common in transcriptome analysis or human microbial ecology studies.  If you want to find a rare, potentially disease-causing strain differently present between a healthy control group and a symptomatic experimental group for example, this is a great way to go about it.

There are many environmental studies for which these techniques are not useful however, as it may be impractical to collect experimental replicates.  For example it is both undesirable and impractical to conduct triplicate or even duplicate sampling in studies focused on high-resolution spatial or temporal sampling.  Sequencing might be cheap now, but time and reagents are not.  Some of the methods I’m working with are designed to aggregate samples into higher-level groups – at which point these methods could be applied by treating within-group samples as “replicates” – but this is only useful if we are interested in testing differential abundance between groups (and doesn’t solve the problem of needing to get samples grouped in the first place).

These methods can be used to explore differential abundance in non-replicated samples, however, they are grossly underpowered when used without replication.  Here’s an analysis of differential abundance between the sample in the first heatmap above and its least similar companion from the same (ongoing) study using DESeq.  You can follow along with any standard abundance table where the rows are samples and the columns are variables.

library(DESeq) ## DESeq wants to oriented opposite how community abundance ## data is normally presented (e.g. to vegan) data.dsq <- t(data) ## DESeq requires a set of conditions which are factors. Ideally ## this would be control and treatment groups, or experimental pools ## or some such, but we don't have that here. So the conditions are ## unique column names (which happen to be dates). conditions <- factor(as.character(colnames(data.dsq))) ## As a result of 16S rRNA gene copy number normalization abundance ## data is floating point numbers, convert to integers. data.dsq <- ceiling(data.dsq, 0) ## Now start working with DESeq. data.ct <- newCountDataSet(, conditions = conditions) data.size <- estimateSizeFactors(data.ct) ## This method and sharing mode is required for unreplicated samples. data.disp <- estimateDispersions(data.size, method = 'blind', sharingMode="fit-only") ## And now we can execute a test of differential abundance for the ## two samples used in the above example. test <- nbinomTest(data.disp, '2014-01-01', '2014-01-06') test <- na.omit(test) ## Plot the results. plot(test$baseMeanA, test$baseMeanB,      #log = 'xy',      pch = 19,      ylim = c(0, max(test$baseMeanB)),      xlim = c(0, max(test$baseMeanA)),      xlab = '2014-01-01',      ylab = '2009-12-14') abline(0, 1) ## If we had significant differences we could then plot them like this: points(test$baseMeanA[which(test$pval < 0.05)],        test$baseMeanB[which(test$pval < 0.05)],        pch = 19,        col = 'red')



As we would expect there are quite a few taxa present in high abundance in one sample and not the other, however, none of the associated p-values are anywhere near significant.  I’m tempted to try to use subsampling to create replicates, which would allow an estimate of variance across subsamples and access to greater statistical power.  This is clearly not as good as real biological replication, but we have to work within the constraints of our experiments, time, and funding…

*You might notice that I’ve deliberately avoided using the terms “microbiome” and “rarefying” here.  In one of his comics Randall Munroe asserted that the number of made up words in a book is inversely proportional to the quality of the book, similarly I strongly suspect that the validity of a sub-discipline is inversely proportional to the number of jargonistic words that community makes up to describe its work.  As a member of said community, what’s wrong with subsampling and microbial community??

Exploring genome content and genomic character with paprica and R

Thu, 07/07/2016 - 13:07

The paprica pipeline was designed to infer the genomic content and genomic characteristics of a set of 16S rRNA gene reads.  To enable this the paprica database organizes this information by phylogeny for many of the completed genomes in Genbank.  In addition to metabolic inference this provides an opportunity to explore how genome content and genomic characteristics are organized phylogenetically.  The following is a brief analysis of some genomic features using the paprica database and R.  If you aren’t familiar with the paprica database this exercise will also familiarize you with some of its content and its organization.

The paprica pipeline and database can be obtained from Github here.  In this post I’ll be using the database associated with version 0.3.1.  The necessary files from the bacteria database (one could also conduct this analysis on the much smaller archaeal database) can be read into R as such:

## Read in the pathways associated with the terminal nodes on the reference tree path <- read.csv('paprica/ref_genome_database/bacteria/terminal_paths.csv', row.names = 1) path[] <- 0 ## Read in the data associated with all completed genomes in Genbank data <- read.csv('paprica/ref_genome_database/bacteria/', row.names = 1) ## During database creation genomes with duplicate 16S rRNA genes were removed, ## so limit to those that were retained data <- data[row.names(data) %in% row.names(path),] ## "path" is ordered by clade, meaning it is in top to bottom order of the reference tree, ## however, "data" is not, so order it here data <- data[order(data$clade),]

One fun thing to do at this point is to look at the distribution of metabolic pathways across the database.  To develop a sensible view it is best to cluster the pathways according to which genomes they are found in.

## The pathway file in the database is binary, so we use Jaccard for distance library('vegan') path.dist <- vegdist(t(path), method = 'jaccard') # distance between pathways (not samples!) path.clust <- hclust(path.dist)

The heatmap function is a bit cumbersome for this large matrix, so the visualization can be made using the image function.

## Set a binary color scheme image.col <- colorRampPalette(c('white', 'blue'))(2) ## Image will order matrix in ascending order, which is not what we want here! image(t(data.matrix(path))[rev(path.clust$order),length(row.names(path)):1],       col = image.col,       ylab = 'Genome',       xlab = 'Pathway',       xaxt = 'n',       yaxt = 'n') box()
The distribution of metabolic pathways across all 3,036 genomes in the v0.3.1 paprica database.

The distribution of metabolic pathways across all 3,036 genomes in the v0.3.1 paprica database.

There are a couple of interesting things to note in this plot.  First, we can see the expected distribution of core pathways present in nearly all genomes, and the interesting clusters of pathways that are unique to a specific lineage.  For clarity row names have been omitted from the above plot, but from within R you can pull out the taxa or pathways that interest you easily enough.  Second, there are some genomes that have very few pathways.  There are a couple of possible reasons for this that can be explored with a little more effort.  One possibility is that these are poorly annotated genomes, or at least the annotation didn’t associate many or any coding sequences with either EC numbers or GO terms – the two pieces of information Pathway-Tools uses to predict pathways during database construction.  Another possibility is that these genomes belong to obligate symbionts (either parasites or beneficial symbionts).  Obligate symbionts often have highly streamlined genomes and few complete pathways.  We can compare the number of pathways in each genome to other genome characteristics for additional clues.

A reasonable assumption is that the number of pathways in each genome should scale with the size of the genome.  Large genomes with few predicted pathways might indicate places where the annotation isn’t compatible with the pathway prediction methods.

## Plot the number of pathways as a function of genome size plot(rowSums(path) ~ data$genome_size,      ylab = 'nPaths',      xlab = 'Genome size') ## Plot P. ubique as a reference point select <- grep('Pelagibacter ubique HTCC1062', data$organism_name) points(rowSums(path)[select] ~ data$genome_size[select],        pch = 19,        col = 'red')
The number of metabolic pathways predicted as a function of genome size for the genomes in the paprica database.

The number of metabolic pathways predicted as a function of genome size for the genomes in the paprica database.

That looks pretty good.  For the most part more metabolic pathways were predicted for larger genomes, however, there are some exceptions.  The red point gives the location of Pelagibacter ubique HTCC1062.  This marine bacterium is optimized for life under energy-limited conditions.  Among its adaptations are a highly efficient and streamlined genome.  In fact it has the smallest genome of any known free-living bacterium.  All the points below it on the x-axis are obligate symbionts; these are dependent on their host for some of their metabolic needs.  There are a few larger genomes that have very few (or even no) pathways predicted.  These are the genomes with bad, or at least incompatible annotations (or particularly peculiar biochemistry).

The other genome parameters in paprica are the number of coding sequences identified (nCDS), the number of genetic elements (nge), the number of 16S rRNA gene copies (n16S), GC content (GC), and phi; a measure of genomic plasticity.  We can make another plot to show the distribution of these parameters with respect to phylogeny.

## Grab only the data columns we want <- data[,c('n16S', 'nge', 'ncds', 'genome_size', 'phi', 'GC')] ## Make the units somewhat comparable on the same scale, a more ## careful approach would log-normalize some of the units first <- decostand(, method = 'standardize') <- decostand(, method = 'range') ## Plot with a heatmap heat.col <- colorRampPalette(c('blue', 'white', 'red'))(100) heatmap(data.matrix(,       margins = c(10, 20),       col = heat.col,       Rowv = NA,       Colv = NA,       scale = NULL,       labRow = 'n',       cexCol = 0.8)
Genomic parameters organized by phylogeny.

Genomic parameters organized by phylogeny.

Squinting at this plot it looks like GC content and phi are potentially negatively correlated, which could be quite interesting.  These two parameters can be plotted to get a better view:

plot($phi ~$GC,      xlab = 'GC',      ylab = 'phi')
The phi parameter of genomic plasticity as a function of GC content.

The phi parameter of genomic plasticity as a function of GC content.

Okay, not so much… but I think the pattern here is pretty interesting.  Above a GC content of 50 % there appears to be no relationship, but these parameters do seem correlated for low GC genomes.  This can be evaluated with linear models for genomes above and below 50 % GC.

gc.phi.above50 <- lm($phi[which($GC >= 50)] ~$GC[which($GC >= 50)]) gc.phi.below50 <- lm($phi[which($GC < 50)] ~$GC[which($GC < 50)]) summary(gc.phi.above50) summary(gc.phi.below50) plot($phi ~$GC,      xlab = 'GC',      ylab = 'phi',      type = 'n') points($phi[which($GC >= 50)] ~$GC[which($GC >= 50)],        col = 'blue') points($phi[which($GC < 50)] ~$GC[which($GC < 50)],        col = 'red') abline(gc.phi.above50,        col = 'blue') abline(gc.phi.below50,        col = 'red') legend('bottomleft',        bg = 'white',        legend = c('GC >= 50',                   'GC < 50'),        col = c('blue', 'red'),        pch = 1)

Genomic plasticity (phi) as a function of GC content for all bacterial genomes in the paprica database.

As expected there is no correlation between genomic plasticity and GC content for the high GC genomes (R2 = 0) and a highly significant correlation for the low GC genomes (albeit with weak predictive power; R2 = 0.106, p = 0).  So what’s going on here?  Low GC content is associated with particular microbial lineages but also with certain ecologies.  The free-living low-energy specialist P. ubique HTCC1062 has a low GC content genome for example, as do many obligate symbionts regardless of their taxonomy (I don’t recall if it is known why this is).  Both groups are associated with a high degree of genomic modification, including genome streamlining and horizontal gene transfer.

The 6 cent speeding ticket

Fri, 06/17/2016 - 11:51

I’m going to go way of the normal track here and do a bit of social commentary.  I heard a radio piece on my drive home yesterday about the challenge of paying court and legal fees for low income wage earners.  This can trap those guilty of minor offenses (like a traffic infraction) in a cycle of jail and debt that is difficult to break out of.  It never made sense to me that financial penalties – which by their nature are punitive – don’t scale with income, as is common in some European countries.  I decided to try and visualize how the weight of a penalty for say, a speeding ticket, scales with income.  It was tempting to try and scale up, i.e. what is the equivalent of $300 for someone earning $X?  I decided however, that it would be more informative to try and scale down.

Cost equivalent in minimum wage earner dollars of a $300 penalty for individuals at different income levels. Red text is the cost equivalent (y-axis value). X-axis (income) is on a log scale. See text for income data sources.

In this scenario the penalty is $300.  Someone earning the federal minimum wage and working 40 hours a week makes $15,080/year, so the $300 penalty is roughly 2 % of annual income.  So be it, perhaps that’s a fair penalty.  But what would be the equivalent if the same offender earned more?  A private/seaman/airman who has just joined the military earns roughly $18,561/year.  Paying the same ticket (and I know from experience that the military pays a lot of them) would equate to the minimum wage earner paying $243.72.  A graduate student fortunate enough to get a stipend (and own a car) might earn $25,000/year.  Paying the same ticket would be equivalent to the lowest wage earner paying $180.96, and down it goes along the income scale.  If LeBron James, who earned $77.2 million last year in salary and endorsements (according to Forbes), got the ticket, the penalty would be equivalent to the lowest income wage earner paying $0.06.  Salary data came from a variety of sources, including here and here.  Salaries marked with an asterisk in the plot above are medians from these sources.

The Database Dilemma

Wed, 05/25/2016 - 12:31

Microbial ecologists know they have a problem with data archiving, particularly when it comes to sequence data.  I’m not entirely sure why this is the case; in theory it would be pretty straightforward to build a database searchable by the parameters ecologists are interested in.  I suspect that part of the problem is a lack of interest on the part of NSF (the primary source of funding for ecology) in designing and maintaining databases, and the perception that this is duplicative of the (rather sad) efforts of the NIH-funded NCBI in this area.  I don’t want to be too disparaging of NCBI – they have a much harder job than the simple database I’ve outlined above – but the Sequence Read Archive (SRA) and other NCBI repositories are a near-total disaster.  We’re wasting a lot of collective time, money, and effort developing sequence data that can’t be used for future studies because it can’t be easily accessed.

In response to this a small number of alternate repositories have popped up.  In my experience however, they don’t offer much of an improvement.  This isn’t meant as a harsh criticism of various peoples’ good efforts, I just don’t think the community’s found a viable model yet.  One example is the VAMPS database operated by the Marine Biological Laboratory, which in addition to having an unfortunate name (I believe it was named before the onset of the angsty-teen vampire thing, try googling “VAMPS” and you’ll see what I mean), tries to do too much.  Rather than organizing data and making it searchable the designers opted to try and build an online analysis pipeline.  Instead of easy to find data you end up with an analytical black box and canned results.

The reason for all this soap-boxing is my experience this week trying to obtain some global marine 16S rRNA gene amplicon data for a project that I’m working on.  The data I had hoped to obtain was used in the 2014 PNAS paper Marine bacteria exhibit a bipolar distribution.  My difficulty in obtaining the data highlights failures at the author, reviewer, journal, and repository levels.  I have great respect for many of the authors on this paper, so again this isn’t meant as a harsh criticism (I’m sure I haven’t made data I’ve published sufficiently accessible either), but rather a lesson in something that we all need to do better at.  I’ll end with the Python-based solution I came up with for getting useful sequence data out of the VAMPS flat-file format.

The 2014 paper used 277 samples that were collected as part of the International Census of Marine Microbes (ICOMM).  There are quite a few things one can say about that particular endeavor but we’ll save it for another post.  The authors included the accession numbers for all these studies in a supplementary file.  This file is a pdf, so not the easiest thing to parse, but at least the text is renderable, so the (7-page) table can be copied into a text file and parsed further.  With a little bit of effort you can get a single-column text file containing the accession numbers for each of these studies.  Great.  Unfortunately a good number of these are wildly incorrect, and most of the remainder point to SRA objects that can’t be recognized by the woefully inadequately documented SRA toolkit.  Here are some examples:

SRX011052 – Enter this accession number into the SRA search bar and it will navigate you to a page for a sample with the helpful name “Generic sample from marine metagenome”.  Nevermind for a moment that “metagenome” is an inappropriate name for an amplicon dataset, where is the data?  Click the link for “Run” (with a different accession number: SRR027219) and navigate to a new page with lots of obscure information you don’t care about.  If you’re actually doing this you might, in near desperation, click the “Download” tab in the hope that this gets you somewhere.  Psych!  All that page has is a message that you need to use the SRA toolkit to download this data.  Let’s try the prefetch utility in the toolkit:

>prefetch SRR027219 2016-05-25T15:30:43 prefetch.2.4.2 err: path not found while resolving tree within virtual file system module - 'SRR027219.sra' cannot be found.

Whatever that means.  There’s a trick to using this utility, which I’ll post as soon as I remember what it is.  The documentation suggests that the above command should work.  The absurdity of needing to rely on this utility in place of FTP usually drives me to using the EMBL-EBI site instead.  This is the Euro equivalent of Genbank, anything submitted to one should be accessible through the other.  In general EMBL-EBI is much better organized and more user friendly than SRA.

Entering “SRX011052” into the ENBL-EBI search bar returns some results that also point us to SRR027219.  The page for SRR027219 has such helpful things as an FTP link for direct download as well as some basic info on the sample.  That’s great, but because the authors opted to provide “Experiment” accession numbers for some samples (e.g. SRX011052) there is no way that I can see to generate a script to automate downloads, as we have no way of knowing the run accession numbers.  The solution is to tediously follow the links to all 277 runs.  I actually started doing this which lead me to my second problem.

SRP001129 – At some point in the supplementary table the accession numbers switch from a “SRX” prefix to “SRP”.  Picking one of these at random and entering it into the EMBL-EBI search bar returns a result for a genome sequencing project related to the Human Microbiome Project.  Clearly this isn’t from the ICOMM project.  Roughly a third of the accession numbers reported in the paper aren’t correct!

Clearly I wasn’t going to get the data that I needed through either EMBL-EBI or the SRA.  Knowing that the study had been part of the ICOMM I turned to the MBL VAMPS database where the ICOMM data is also archived.  Thinking that I was on the home stretch I followed the link to the ICOMM portal from the VAMPS homepage.  The right side column gave me two options for download; tar-zipped sff files or a tab-delimited file.  Not wanting to deal with converting from sff (the organic 454 file format) I took a look at the tab-delim option.  The top lines of the file look like this:

project dataset sequence taxonomy knt rank total_knt frequency ICM_ABR_Av6 ABR_0006_2005_01_07 GAAACTCACCAAGGCCGACTGTTAGACGAAGATCAGCGTAATGAGCTTATCGGATTTTCAGAGAG Archaea;Euryarchaeota;Thermoplasmata;Thermoplasmatales;Marine_Group_II 1 family 36172 2.7645693e-05 ICM_ABR_Av6 ABR_0006_2005_01_07 GAAACTCACCAAGGCCGACTGTTAGATGAAGATCAGCGTAATGAGCTTATCGGATTTTCAGA Archaea;Euryarchaeota;Thermoplasmata;Thermoplasmatales;Marine_Group_II 1 family 36172 2.7645693e-05 ICM_ABR_Av6 ABR_0006_2005_01_07 GAAACTCACCAAGGCCGACTGTTAGATGAAGATCAGCGTAATGAGCTTATCGGATTTTCAGAGAG Archaea;Euryarchaeota;Thermoplasmata;Thermoplasmatales;Marine_Group_II 13 family 36172 0.000359394006 ICM_ABR_Av6 ABR_0006_2005_01_07 GAAACTCACCAAGGCCGACTGTTAGATGAAGATCAGCGTAATGAGCTTATCGGATTTTCTAGAGAG Archaea;Euryarchaeota;Thermoplasmata;Thermoplasmatales;Marine_Group_II 1 family 36172 2.7645693e-05 ICM_ABR_Av6 ABR_0006_2005_01_07 GAAACTCACCAAGGCCGACTGTTAGATGAAGATTCAGCGTAATGAGCTTATCGGATTTTCAGAGAG Archaea;Euryarchaeota;Thermoplasmata;Thermoplasmatales;Marine_Group_II 1 family 36172 2.7645693e-05 ICM_ABR_Av6 ABR_0006_2005_01_07 GAAACTCACCAGGGCCGACTGTTAGATGAAGACCAGTGTAACGAACTTGTCGGATTTTCAGAGAG Archaea;Euryarchaeota;Thermoplasmata;Thermoplasmatales;Marine_Group_II 1 family 36172 2.7645693e-05 ICM_ABR_Av6 ABR_0006_2005_01_07 GAAACTCACCAGGGCCGACTGTTATATGAAGACCAATGTGATGAACTTGTCGGATTTTCAGAGAG Archaea;Euryarchaeota;Thermoplasmata;Thermoplasmatales;Marine_Group_II 2 family 36172 5.5291386e-05 ICM_ABR_Av6 ABR_0006_2005_01_07 GAAACTCACCAGGGCCGACTGTTATATGAAGACCAGCGTAATGAGCTTGTCGGATTTTCAGAGAG Archaea;Euryarchaeota;Thermoplasmata;Thermoplasmatales;Marine_Group_II 1 family 36172 2.7645693e-05 ICM_ABR_Av6 ABR_0006_2005_01_07 GAAACTCACCAGGGCCGACTGTTATATGAAGACCAGCGTGATGAACTTGTCGGATTTTCAGAGAG Archaea;Euryarchaeota;Thermoplasmata;Thermoplasmatales;Marine_Group_II 1 family 36172 2.7645693e-05

Okay, so this is a bizarre way to share sequence data but at this point I was willing to take it.  Basically the file gives us unique reads, their abundance (knt), and taxonomy according to VAMPS.  Want I needed to do was convert this into a separate, non-unique fasta file for each sample.  With a little help from Pandas this wasn’t too bad.  The vamps_icomm_surface.csv file specified below is just a csv version of the pdf table in the PNAS publication, after some tidying up.  The second row of the csv file is the sample name in VAMPS (I had to split the first column of the original table on ‘.’ to get this).

import pandas as pd ## get the metadata for surface samples, this was generated ## from the pdf table in the PNAS paper metadata = pd.read_csv('vamps_icomm_surface.csv', index_col = 1) ## get the seqs, downloaded from seqs = pd.read_csv('icomm_data.tsv', delimiter = '\t', index_col = 1) ## iterate across each study in metadata, making a temp dataframe ## of reads from that study for study in metadata.index: temp = seqs.loc[study] temp_bacteria = temp[temp['taxonomy'].str.contains('Bacteria')] ## now iterate across each row in the temp dataframe, exporting ## the sequence the specified number of times to a fasta file with open(study + '.bacteria.fasta', 'w') as output: for row in temp.iterrows(): sequence = row[1]['sequence'] for i in range(1, row[1]['knt']): name = row[0] + '.' + str(i) print name print >> output, '>' + name print >> output, sequence

To end I’d like to circle back to the original dual problems of poor database design and erroneous accession numbers.  Although these problems were not directly connected in this case they exacerbated each other, magnifying the overall problem of data (non)reuse.  Some suggestions:

  1.  The SRA and other databases needs to aggregate the run accession numbers for each new study.  A couple of clicks should lead me to a download link for every sequence associated with a study, whether that study generated the raw data or only used it.  I don’t believe it is currently possible to do this within SRA; initiating a new project assumes that you will contribute new raw sequences (correct me if you think otherwise).  This should be easy to implement, and the association of a publication with a project should be rigorously enforced by journals, departments, and funding agencies.
  2. Accession numbers should be automatically checked by journals the way references are.  If it points to something that isn’t downloadable (or at least has a download link on the page) the author should correct it.  There is no excuse for incorrect, incomplete, or non-run accession numbers.  That this was allowed to happen in PNAS – a top-tier journal – is double frustrating.  And frankly, if you reviewed this paper you owe me beer for not spot checking the refs.
  3. Validating this information is the job of the reviewers.  Almost nobody does this.  Usually when I review a paper containing sequence data I’m the only reviewer that insists the data is available at the time of publication (I can’t think of a single case where I wasn’t the only reviewer to insist).  Not only is this a necessary check, but part of the job of the reviewer is to spot check the analysis itself.  Clearly, without data that’s not being done either…

And that’s enough ranting about data availability for one day, back to work…

Roadmap to Ocean Worlds: Polar microbial ecology and the search for boring, normal life

Mon, 05/16/2016 - 20:11

Recently congress recommended that NASA create an Ocean Worlds Exploration Program whose primary goal is “to discover extant life on another world using a mix of Discovery, New Frontiers, and flagship class missions”.  Pretty awesome.  In February I was invited to participate on the science definition team for the Roadmap to Ocean Worlds (ROW) initiative.  The ROW is step one in NASA’s response to the congressional recommendation.  Over the last few months the science definition team has been meeting remotely to hash through some challenging questions around the central themes of identifying ocean worlds, defining their habitability, and understanding the life that might live in or on them.  This week we will have our first and only opportunity to meet in person as a team.  As one of very few biologists in the group, and possible the only ecologist, I’ll be giving a short talk on polar microbial ecology as it pertains to discovering and understanding life on other worlds.  As a way to organize my thoughts on the subject and prep for the talk I’m going to try an experiment and write out the main points of the presentation as an article.

I decided to title the talk (and this post) Polar microbial ecology and the search for boring, normal life because it reflects an idea that I and others have been advocating for some time: there isn’t anything that special about extreme environments (gasp!).  Life functions there in pretty much the same way that it functions everywhere else.  One way to think of this is that polar microbes (and other extremophiles) are uniquely adapted but not unique; they follow the standard rules of ecology and evolution, including how they acquire energy and deal with stress.  The nuances of how they interact with their environment reflect the unique characteristics of that environment, but the microbes are generally kind of conventional.

Act I: Challenges and opportunities for polar microbes

As with microbes in any environment, polar microbes are presented with a range of challenges and opportunities that define how they live.  From a microbial perspective challenges are stressors, or things that cause damage.  Opportunities are sources of energy.  Very often these things are the same.  Take sunlight, for example.  This is the single most important source of energy on the planet, but get a little too much of it (or any of it, to be more accurate), and you start doing damage to yourself.  As a result you have to invest some of the acquired energy into offsetting this stress.  We can illustrate this with a very simple conceptual model.

stress_energyThe above figure shows a hypothetical (but plausible) relationship between energy and stress.  Energy is shown both on the x-axis and as the blue 1:1 line.  Stress is the orange line.  In this scenario as energy increases, stress increases logarithmically.  The energy that is available for growth (that which is not dedicated to offsetting the damage caused by stress) is energy – stress, or the difference between the blue and orange lines.  That looks like this:

biomass_stressThe units for biomass are, of course, arbitrary, what matters is the shape of the curve.  Past a critical energy value the amount of biomass an ecosystem can sustain actually decreases as energy increases.  At the point at which stress = energy, no biomass is being accumulated.  It isn’t a coincidence that the shape of the biomass curve is the same as experimentally determined temperature curves for bacterial isolated belonging to different temperature classes (below).  In those curves the energy in the system is increasing as temperature rises.  This enhances enzymatic reactions (producing more biomass) until there is so much energy that the system becomes disordered (enzymes denature).  This process is directly analogous to the one presented here, but takes place at the population level rather than the ecosystem level.


Figure from  Temperature curves for a psychrophilic (optimized to low temperature) bacteria and a mesophilic (optimized to roughly human body temperature) relative.

One thing that I would like to make clear up front is that low temperature itself is not a stressor.  Low temperature alone doesn’t kill single-celled organisms, it only impedes their ability to deal with other stress.  Because of this energy and stress have a very complicated relationship with temperature in cold ecosystems.  For example:

  • As temperature decreases, energy decreases.
  • As energy decreases, stress decreases (remember that in our conceptual model stress is a function of energy).
  • As temperature decreases, access to substrate (and by proxy energy) increases.
  • As temperature decreases, inhibitory compounds (which are stress) increases.

So many of the effects of low temperature are in conflict.  Highly adapted psychrophiles play off these conflicts to establishes niches in cold environments.  We can illustrate this idea by focusing on the last two bullet points, which are specific to environments that contain ice.  Impure ice (which virtually all environmental ice is) does not form a solid structure when it freezes.  It forms a porous matrix, with the porosity determined by the concentration of solutes and the temperature.  You can actually view this in 3D using sophisticated X-ray tomography techniques, as shown in this figure from Pringle et al., 2009.

From Pringle et al., 2009

From Pringle et al., 2009.  X-ray tomography images of saline ice at different temperatures.  The gold represents pore spaces, which decrease in size as the temperature drops.

The gold spaces in this figure are pore spaces within the ice.  As the ice gets colder the pore spaces become smaller and the solutes contained in them become more concentrated.  That’s because whatever was dissolved in the starting material (e.g. salt, sugars, small particles) is excluded from the ice as it forms crystals, the excluded material ends up in the pore spaces.  The colder the ice, the smaller the spaces, the more concentrated the solutes.  Bacteria and ice algae that are also excluded into these spaces will experience a much higher concentration of nutrients, potential sources of carbon, etc.  This equates to more energy, which helps them offset the stress of being in a very salty environment.  In very cold, relatively pure ice, solute concentrations can be 1000x those of the starting material.  Neat, right?

Here’s what decreasing temperature actually means for a bacterial cell trying to make a go of it in a cold environment:

quick_blog_figOur earlier plot of biomass depicted cells in the “growth” phase shown here.  Only very early in the plot, when energy was very low, and at the very end, when stress was very high, did biomass approach zero.  That zero point is called the maintenance phase.  Maintenance phase happens when decreasing temperature suppresses the ability to deal with stress to the point that a cell cannot invest enough resources into creating biomass to reproduce.  The cell is investing all its energy in offsetting the damage caused by stress.

At increasingly low temperatures the synthesis of new biomass becomes increasingly less efficient, requiring a proportionally greater expenditure of energy (we say that bacterial growth efficiency decreases).  At the end of the maintenance phase the bacterial cell is respiring, that is it is creating energy, but this is resulting in virtually no synthesis or repair of cellular components.  This leads inevitably to cell death when enough cellular damage has accumulated.  It’s sad.

If we lift our heads up out of the thermodynamic weeds for a moment we can consider the implications of all this for finding life on another cold world (all the good ones are cold):

quick_blog_fig2On the assumption that everyone’s asleep by this point in the presentation I’m trying to be funny, but the point is serious.  You hear a lot of silly (my bias) talk in the astrobiology community about life at the extremes; how low can it go temp wise etc.  Who cares?  I’m not particularly interested in finding a maintenance-state microbial community, nor do we stand a very good chance of detecting one even if it was right under our (lander’s) nose.  The important contribution to make at this points is to determine where life is actively growing on the relevant ocean worlds.  Then we can try to figure out how to reach those places (no mean task!).  The point of this whole exercise it to find extant life, after all…

Act II: Where are polar microbes distributed and why?

I can think of 8 polar environments that are particularly relevant to ocean worlds:

  • Supraglacial environment (glacier surface)
  • Interstitial glacier environment (glacier interior)
  • Subglacial environment (where the glacier meets rock)
  • Sea ice surface (top’o the sea ice)
  • Interstitial sea ice (sea ice interior)
  • Sea ice-seawater interface (where the sea ice meets the seawater)
  • Water column below ice (the ocean! or a lake)
  • The sediment-water interface (where said lake/ocean meets mud)

Each of these environments has a range of stress, energy, and temperatures, and this to a large extent defines their ability to support biomass.  For the purpose of this discussion I’ll report biomass in units of bacteria ml-1.  To give some sense of what this means consider that what the waste-treatment profession politely calls “activated sludge” might have around 108 bacteria ml-1.  Bottled water might have 103 bacteria ml-1.  Standard ocean water ranges between 104 bacteria ml-1 and 106 bacteria ml-1 at the very highest end.  One further note on biomass… our conceptual model doesn’t account for grazing, or any other trophic transfer of biomass, because that biomass is still in the biological system.  Bacteria ml-1 doesn’t reflect this, because bacteria are consumed by other things.  So the values I give for the ecosystems below are at best a loose proxy for the capacity of each ecosystem to support biomass.

From left to right, top to bottom: supraglacial environment (, interstitial glacier environment (Price, 2000), subglacier environment – arguably a subglacial lake, but a very shallow one with interface characteristics (, sea ice surface (hey, that’s me!), sea ice interior (Krembs et al., 2002), sea ice-seawater interface (I.A. Melnikov), water below ice (from our last season at Palmer Station), sediment-water interface (

Let’s consider the sea ice-seawater interface and the sea ice surface in a little more detail.  The sea ice-seawater interface is interesting because it has, among all polar microbial environments, probably the greatest capacity to support biomass (we’re talking about summertime sea ice, of course).  It is located at the optimum balance point between energy and stress, as shown in the figure below.  Despite the fact that the sea ice surface has relatively high biomass, it is located at the extreme right side of the plot.  The trick is that biomass has accumulated at the sea ice surface as a result of abiotic transport, not in situ growth.  I explored this pretty extensively back in the very early days of my PhD (see here and here).


Despite the fact that the sea ice surface doesn’t function as a microbial habitat, the concentration of biomass there might be relevant to our goal of finding extant life on another world.  The accumulation of bacteria at the ice surface is largely the result of bacteria being passively transported with salt.  Recall that saline ice is a porous matrix containing a liquid brine.  All of the bacteria in the ice are also contained in the liquid brine; as the ice cools the pore spaces get smaller, forcing some of the brine and bacteria to the ice surface.  Porous ice in say, the Europan ice shell would act similarly.  Salt is easier to identify than life (in fact we know that there are large salt deposits on the surface of Europa), so we can target salty areas for deeper search.  In astrobiology we like to “follow things” (because we get lost easy?); “follow the water” and “follow the energy” are often cited and somewhat useful axioms.  So here we have a “follow the salt” situation.

Act III: Can we explore polar microbial ecology in life detection?

Act II ended on a life detection note, so let’s follow this through and consider other details of polar microbial ecology that might give us clues of what to look for if we want to identify life.  Getting ecology to be part of this discussion (life detection) is non-trivial.  Despite the fact that the whole field of astrobiology is really an elaborate re-visioning of classic ecology (who lives where and why), there isn’t a whole lot of interest in studying how organisms interact with their environment.  This simply reflects the field’s disciplinary bias; the most active communities within astrobiology may well be astronomers and planetary geologists.  If the most active community was ecologists we’d probably be ignoring all kinds of important stuff about how planets are formed, etc.  To illustrate the utility of ecology for life detection here are two examples of ecological principles that might lead to life detection techniques.

Case 1 – Biological alteration of the microstructure of ice

From Krembs et al., 2002.

From Krembs et al., 2002.  This sea ice diatom has filled the pore space with EPS as a buffer against high salinity and as a cryoprotectant.

Earlier I put forward that decreasing temperatures can lead to heightened stress because life in ice is exposed to higher concentrations of damaging substances at lower temperatures.  The most obvious example is salt.  In very cold sea ice the pore space salinity approaches 200 ppt, that’s roughly 6 times the salinity of seawater!  Maintaining adequate internal water is a huge challenge for a cell under these conditions.  One of the mechanisms for dealing with this is to produce copious amounts of exopolysaccharides (EPS), a hydrated gel (essentially mucous) containing polysaccharides, proteins, and short peptides.  EPS buffers the environment around a cell, raising the activity of water and, in some cases, interacting with ice.  This produces a highly modified internal structure, as shown in the images below.  This alteration could be a useful way of identifying ice on an ocean world that has been modified by a biological community.

From Krembs et al., 2011.

From Krembs et al., 2011.

Case 2: Motility

Motility has been put forward before as an unambiguous signature of life, but the idea hasn’t really gained a lot of traction.  Clearly one needs to be cautious with this – Brownian motion can look a lot like motility – but I can’t think of anything else that life does that is as easily distinguishable from abiotic processes.  One additional challenge however, is the not all life is motile.  Plants aren’t motile, at least most of them over the timescales that we care to stare at them for.  Not all microbes are motile either, but I would argue that those that aren’t aren’t only because others are.

From Stocker, 2012.

From Stocker, 2012.

Consider the figure at right, which is a cartoon of two modes of bacterial life in the ocean.  One of those modes is motile, and can be seen using flagella to follow chemical gradients (we call this chemotaxis) and optimize their location with respect to phytoplankton, their source of carbon.  The second mode is much smaller and in the background; small-bodied non-motile cells that live on the diffuse carbon that they opportunistically encounter.  This works because that would be an inefficient niche for motile bacteria to exploit.  In the absence of motility however, chemical gradients constitute a very strong selective pressure to evolve motility.  We can see evidence of this in the convergent evolution of flagellar motility (and other forms of motility) in all three domains of life.  Although they may share a common chemotaxis sensory mechanism, the Bacteria, Archaea, and Eukarya all seem to have evolved flagellar motility independently.  This means that it’s probably a pretty good feature to have, and is likely to be shared by microbes on other ocean worlds.

That was quite a lot, so to summarize:

  • Microbial communities are oriented along gradients of energy and stress.  At some optimal point along that gradient a maximum amount of biomass can be supported by surplus energy that is not being used to deal with stress (How much energy do you spend on stress?  Think of how much more biomass you could have!).
  • The relationships between energy, stress, and temperature are complicated, but Earth life generally works at T > -12 °C.  This estimate is probably a little lower than the reality, because laboratory observations of growth at that temperature don’t accurately reflect environmental stress.
  • Life is strongly biased towards surfaces and interfaces, these may provide enhanced opportunities for life detection (follow the salt!).
  • The specific ecology of cold organisms can provide some further insights into life detection strategies.  For example, motility might be an under-appreciated signature of life.


Phylogenetic placement re-re-visited

Fri, 05/13/2016 - 15:48

Disclaimer: I banged this out fast from existing scripts to help some folks, but haven’t tested it yet.  Will do that shortly, in the meantime, be careful!


I use phylogenetic placement, namely the program pplacer, in a lot of my publications.  It is also a core part of of the paprica metabolic inference pipeline.  As a result I field a lot questions from people trying to integrate pplacer into their own workflows.  Although the Matsen group has done an excellent job with documentation for pplacer, guppy, and taxtastic, the three programs you need to work with to do phylogenetic placement from start to finish (see also EPA), there is still a steep learning curve for new users.  In the hope of bringing the angle of that curve down a notch or two, and updating my previous posts on the subject (here and here), here is a complete, start to finish example of phylogenetic placement, using 16S rRNA gene sequences corresponding to the new tree of life recently published by Hug et al.  To follow along with the tutorial start by downloading the sequences here.

You can use any number of alignment and tree building programs to create a reference tree for phylogenetic placement.  I strongly recommend using RAxML and Infernal.  After a lot of experimentation this combination seems to be produce the most correct topologies and best supported trees.  You should be aware that no 16S rRNA gene tree (or any other tree) is absolutely “correct” for domain-level let alone life-level analyses, but nothing in life is perfect.  While you’re installing software I also recommend the excellent utility Seqmagick.  Finally, you will need a covariance model of the 16S rRNA gene to feed into Infernal.  You can find that at the Rfam database here.

The workflow will follow these steps:

  1. Create an alignment of the reference sequences with Infernal
  2. Create a phylogenetic tree of the alignment
  3. Create a reference package from the alignment, tree, and stats file
  4. Proceed with the phylogenetic placement of your query reads

Create an alignment of the reference sequences

The very first thing that you need to do is clean your sequence names of any wonky punctuation.  This is something that trips up almost everyone.  You should really have no punctuation in the names beyond “_”, and no spaces!

tr " -" "_" < hug_tol.fasta | tr -d "%\,;():=.\\*[]\"\'" > hug_tol.clean.fasta

Next create an alignment from the cleaned file.  I always like to degap first, although it shouldn’t matter.

## Degap seqmagick mogrify --ungap hug_tol.clean.fasta ## Align cmalign --dna -o hug_tol.clean.align.sto --outformat Pfam hug_tol.clean.fasta ## Convert to fasta format seqmagick convert hug_tol.clean.align.sto hug_tol.clean.align.fasta

Build the reference tree

At this point you should have a nice clean DNA alignment in the fasta format.  Now feed it to RAxML to build a tree.  Depending on the size of the alignment this can take a little bit.  I know you’ve read the entire RAxML manual so of course you are already aware that adding additional cpus won’t help…

raxmlHPC-PTHREADS-AVX2 -T 8 -m GTRGAMMA -s hug_tol.clean.align.fasta -n ref.tre -f d -p 12345

I like having a sensibly rooted tree; it’s just more pleasing to look at.  You can do this manually, or you can have RAxML try to root the tree for you.

raxmlHPC-PTHREADS-AVX2 -T 2 -m GTRGAMMA -f I -t RAxML_bestTree.ref.tre -n root.ref.tre

Okay, now comes the tricky bit.  Clearly you’d like to have some support values on your reference tree, but the Taxtastic program that we will use to build the reference tree won’t be able to read the RAxML stats file if it includes confidence values.  The work around is to build a second tree with confidence values.  You will feed this tree to Taxtastic with the stats file from the tree we already generated.

## Generate confidence scores for tree raxmlHPC-PTHREADS-AVX2 -T 8 -m GTRGAMMA -f J -p 12345 -t RAxML_rootedTree.root.ref.tre -n conf.root.ref.tre -s hug_tol.clean.align.fasta

Now we can use the alignment, the rooted tree with confidence scores, and the stats file without confidence scores to create our reference package.

taxit create -l 16S_rRNA -P hug_tol.refpkg --aln-fasta hug_tol.clean.align.fasta --tree-stats RAxML_info.ref.tre --tree-file RAxML_fastTreeSH_Support.conf.root.ref.tre

Align the query reads

At this point you have the reference package and you can proceed with analyzing some query reads!  The first step is to align the query reads in exactly the same fashion as the reference sequences.  This is important as the alignments will be merged later.

## Clean the names tr " -" "_" < query.fasta | tr -d "%\,;():=.\\*[]\"\'" > query.clean.fasta ## Remove any gaps seqmagick mogrify --ungap ## Align cmalign --dna -o query.clean.align.sto --outformat Pfam query.clean.fasta

Now we use the esl-alimerge command, included with Infernal, to merge the query and reference alignments.

## Merge alignments esl-alimerge --outformat pfam --dna -o query.hug_tol.clean.align.sto query.clean.align.sto hug_tol.refpkg/hug_tol.clean.align.sto ## Convert to fasta seqmagick convert query.hug_tol.clean.align.sto

Phylogenetic placement

Now we’re on the home stretch, we can execute the phylogenetic placement itself!  The flags are important here, so it’s worth checking the pplacer documentation to insure that your goals are consistent with mine (get a job, publish some papers?).  You can probably accept most of the flags for the previous commands as is.

pplacer -o query.hug_tol.clean.align.jplace -p --keep-at-most 20 -c hug_tol.refpkg query.hug_tol.clean.align.fasta

At this point you have a file named query.hug_tol.clean.align.jplace.  You will need to use guppy to convert this json-format file to information that is readable by human.  The two most useful guppy commands (in my experience) for a basic look at your data are:

## Generate an easily parsed csv file of placements, with only a single placement reported for each ## query read. guppy to_csv --point-mass --pp -o query.hug_tol.clean.align.csv query.hug_tol.clean.align.jplace ## Generate a phyloxml tree with edges fattened according to the number of placements. guppy fat --node-numbers --point-mass --pp -o query.hug_tol.clean.align.phyloxml query.hug_tol.clean.align.jplace

Another victim of science funding

Tue, 04/12/2016 - 15:27

Or rather the lack thereof.  I was very disappointed to receive an email yesterday that BioCyc, a popular database of enzymes and metabolic pathways in model organisms, is moving to a subscription model.  The email is posted below in its entirety (to assist in the request to spread the word).  With this move BioCyc cements the pattern established several years ago by the Koyoto Encyclopedia of Genes and Genomes (KEGG).  I don’t begrudge the architects of these databases for moving behind a paywall; as the BioCyc principal investigator Peter Karp notes below, developing and maintaining a high-quality biological database is time and resource intensive, and skilled curators are expensive.  I do think however, that moving these resources to a subscription-based model does a disservice to the research community and is not in the public interest.  While it was not the responsibility of US funding agencies to ensure the long-term viability and accessibility of KEGG, BioCyc is in their court.  In my opinion the failure of NSF and NIH to adequately support community resources amounts to a dereliction of duty.

As noted in the letter below one of the challenges faced by database designers and curators is the tradition of peer review.  At its best the peer review process is an effective arbitrator of research spending.  I think that the peer review process is at its best only under certain funding regimes however, and suspect that the current low rates of federal funding for science do not allow for fair and unbiased peer review.  This is particularly the case for databases and projects whose return on investment is not evident in the short term in one or two high-impact publications, but in the supporting role played in dozens or hundreds or thousands of studies across the community over many years.  Program officers, managers, and directors need to be cognizant of the limitations of the peer review process, and not shy away from some strategic thinking every now and again.

My first experience with BioCyc came in my first year of graduate school, when I was tentatively grappling with the relationship between gene and genome functions and the ecology of cold-adapted microbes.  Like many academic labs around the country the lab I was in was chronically short on funds, and an academic license for expensive software (e.g. CLC Workbench) or a database subscription (á la KEGG or BioCyc – both were thankfully free back in those days) would have been out of the question.  Without these tools I simply would have had nowhere to start my exploration.  I fear that the subscription model creates intellectual barriers, potentially partitioning good ideas (which might arise anywhere) from the tools required to develop them (which will only be found in well-funded or specialist labs).

Viva community science!


Dear Colleague,

I am writing to request your support as we begin a new chapter in the development of BioCyc.

In short, we plan to upgrade the curation level and quality of many BioCyc databases to provide you with higher quality information resources for many important microbes, and for Homo sapiens.  Such an effort requires large financial resources that — despite numerous attempts over numerous years — have not been forthcoming from government funding agencies.  Thus, we plan to transition BioCyc to a community-supported non-profit subscription model in the coming months.

Our Goal

Our goal at BioCyc is to provide you with the highest quality microbial genome and metabolic pathway web portal in the world by coupling unique and high-quality database content with powerful and user-friendly bioinformatics tools.

Our work on EcoCyc has demonstrated the way forward.  EcoCyc is an incredibly rich and detailed information resource whose contents have been derived from 30,000 E. coli publications.  EcoCyc is an online electronic encyclopedia, a highly structured queryable database, a bioinformatics platform for omics data analysis, and an executable metabolic model.  EcoCyc is highly used by the life-sciences community, demonstrating the need and value of such a resource.

Our goal is to develop similar high-quality databases for other organisms.  BioCyc now contains 7,600 databases, but only 42 of them have undergone any literature-based curation, and that occurs irregularly.  Although bioinformatics algorithms have undergone amazing advances in the past two decades, their accuracy is still limited, and no bioinformatics inference algorithms exist for many types of biological information.  The experimental literature contains vast troves of valuable information, and despite advances in text mining algorithms, curation by experienced biologists is the only way to accurately extract that information.

EcoCyc curators extract a wide range of information on protein function; on metabolic pathways; and on regulation at the transcriptional, translational, and post-translational levels.

In the past year SRI has performed significant curation on the BioCyc databases for Saccharomyces cerevisiae, Bacillus subtilis, Mycobacterium tuberculosis, Clostridium difficile, and (to be released shortly) Corynebacterium glutamicum. All told, BioCyc databases have been curated from 66,000 publications, and constitute a unique resource in the microbial informatics landscape.  Yet much more information remains untapped in the biomedical literature, and new information is published at a rapid pace.  That information can be extracted only by professional curators who understand both the biology, and the methods for encoding that biology in structured databases.  Without adequate financial resources, we cannot hire these curators, whose efforts are needed on an ongoing basis.

Why Do We Seek Financial Support from the Scientific Community?

The EcoCyc project has been fortunate to receive government funding for its development since 1992.  Similar government-supported databases exist for a handful of biomedical model organisms, such as fly, yeast, worm, and zebrafish.

Peter Karp has been advocating that the government fund similar efforts for other important microbes for the past twenty years, such as for pathogens, biotechnology workhorses, model organisms, and synthetic-biology chassis for biofuels development.  He has developed the Pathway Tools software as a software platform to enable the development of curated EcoCyc-like databases for other organisms, and the software has been used by many groups. However, not only has government support for databases not kept pace with the relentless increases in experimental data generation, but the government is funding few new databases, is cutting funding for some existing databases (such as for EcoCyc, for BioCyc, and for TAIR), and is encouraging the development of other funding models for supporting databases [1].  Funding for BioCyc was cut by 27% at our last renewal whereas we are managing five times the number of genomes as five years ago. We also find that even when government agencies want to support databases, review panels score database proposals with low enthusiasm and misunderstanding, despite the obvious demand for high-quality databases by the scientific community.

Put another way: the Haemophilus influenzae genome was sequenced in 1995.  Now, twenty years later, no curated database that is updated on an ongoing basis exists for this important human pathogen.  Mycobacterium tuberculosis was sequenced in 1998, and now, eighteen years later, no comprehensive curated database exists for the genes, metabolism, and regulatory network of this killer of 1.5 million human beings per year.  No curated database exists for the important gram-positive model organism Bacillus subtilis.  How much longer shall we wait for modern resources that integrate the titanic amounts of information available about critical microbes with powerful bioinformatics tools to turbocharge life-science research?

How it Will Work and How You Can Support BioCyc

The tradition whereby scientific journals receive financial support from scientists in the form of subscriptions is a long one.  We are now turning to a similar model to support the curation and operation of BioCyc.  We seek individual and institutional subscriptions from those who receive the most value from BioCyc, and who are best positioned to direct its future evolution.  We have developed a subscription-pricing model that is on par with journal pricing, although we find that many of our users consult BioCyc on a daily basis — more frequently than they consult most journals.

We hope that this subscription model will allow us to raise more funds, more sustainably, than is possible through government grants, through our wide user base in academic, corporate, and government institutions around the world.  We will also be exploring other possible revenue sources, and additional ways of partnering with the scientific community.

BioCyc is collaborating with Phoenix Bioinformatics to develop our community-supported subscription model.  Phoenix is a nonprofit that already manages community financial support for the TAIR Arabidopsis database, which was previously funded by the NSF and is now fully supported [2] by users.

Phoenix Bioinformatics  will collect BioCyc subscriptions on behalf of SRI International, which like Phoenix is a non-profit institution. Subscription revenues will be invested into curation, operation, and marketing of the BioCyc resource.

We plan to go slow with this transition to give our users time to adapt.  WeÕll begin requiring subscriptions for access to BioCyc databases other than EcoCyc and MetaCyc starting in July 2016.

Access to the EcoCyc and MetaCyc databases will remain free for now.

Subscriptions to the other 7,600 BioCyc databases will be available to institutions (e.g., libraries), and to individuals.  One subscription will grant access to all of BioCyc.  To encourage your institutional library to sign up, please contact your science librarian and let him or her know that continued access to BioCyc is important for your research and/or teaching.

Subscription prices will be based on website usage levels and we hope to keep them affordable so that everyone who needs these databases will still be able to access them.  We are finalizing the academic library and individual prices and will follow up soon with more information including details on how to sign up. We will make provisions to ensure that underprivileged scientists and students in third-world countries arenÕt locked out.

Please spread the word to your colleagues — the more groups who subscribe, the better quality resource we can build for the scientific community.

Peter Karp

Director, SRI Bioinformatics Research Group



Tutorial: Annotating metagenomes with paprica-mg

Sat, 03/26/2016 - 03:00

This tutorial is both a work in progress and a living document.  If you see an error, or want something added, please let me know by leaving a comment.

Starting with version 3.0.0 paprica contains a metagenomic annotation module.  This module takes as input a fasta or fasta.gz file containing the QC’d reads from a shotgun metagenome and uses DIAMOND Blastx to classify these reads against a database derived from the paprica database.  The module produces as output:

  1.  Classification for each read in the form of an EC number (obviously this applies only to reads associated with genes coding for enzymes).
  2. A tally of the occurrence of each EC number in the sample, with some useful supporting information.
  3. Optionally, the metabolic pathways likely to be present within the sample.

In addition to the normal dependencies paprica-mg requires DIAMOND Blastx.  Follow the instructions in the DIAMOND manual, and be sure to add the location of the DIAMOND executables to your PATH.  If you want to predict metabolic pathways on your metagenome you will need to also download the pathway-tools software.  See the notes here.

There are two ways to obtain the paprica-mg database.  You can obtain a pre-made version of the database by downloading the files paprica-mg.dmnd and (large!) to the ref_genome_database directory.  Be sure to gunzip before continuing.  If you wish to build the paprica-mg database from scratch, perhaps because you’ve customized that database or are building it more frequently than the release cycle, you will need to first build the regular paprica database.  Then build the paprica-mg database as such: -ref_dir ref_genome_database

If you’ve set paprica up in the standard way you can be execute this command from anywhere on your system; the paprica directory is already in your PATH, and the script will look for the directory “ref_genome_database” relative to itself.  Likewise you don’t need to be in the paprica directory to execute the script.

Once you’ve downloaded or  built the database you can run your analysis.  It is worth spending a little time with the DIAMOND manual and considering the parameters of your system.  To try things out you can download a “smallish” QC’d metagenome from the Tara Oceans Expedition (selected randomly for convenient size):

## download a test metagenome wget ## execute paprica-mg for EC annotation only -i ERR318619_1.qc.fasta.gz -o test -ref_dir ref_genome_database -pathways F

This will produce the following output:

test.annotation.csv: The number of hits in the metagenome, by EC number.  See the paprica manual for a complete explanation of columns. The DIAMOND format results file.  Only one hit per read is reported. A text file of the DIAMOND results.  Only one hit per read is reported.

Predicting pathways on a metagenome is very time intensive and it isn’t clear what the “correct” way is to do this.  I’ve tried to balance speed with accuracy in paprica-mg.  If you execute with -pathways T, DIAMOND is executed twice; once for the EC number annotation as above (reporting only a single hit for each read), and once to collect data for pathway prediction.  On that search DIAMOND reports as many hits for each read as there are genomes in the paprica database.  Of course most reads will have far fewer (if any) hits.  The reason for this approach is to try and reconstruct as best as possible the actual genomes that are present.  For example, let’s say that a given read has a significant hit to an enzyme in genome A and genome B.  When aggregating information for pathway-tools the enzyme in genome A and genome B will be presented to pathway-tools in separate Genbank files representing separate genetic elements.  Because a missing enzyme in either genome A or genome B could cause a negative prediction for the pathway, we want the best chance of capturing the whole pathway.  So a second enzyme, critical to the prediction of that pathway, might get predicted for only genome A or genome B.  The idea is that the incomplete pathways will get washed out at the end of the analysis, and since pathway prediction is by its nature non-redundant (each pathway can only be predicted once) over-prediction is minimized.  To predict pathways during annotation:

## execute paprica-mg for EC annotation and pathway prediction -i ERR318619_1.qc.fasta.gz -o test -ref_dir ref_genome_database -pathways T -pgdb_dir /location/of/ptools-local

In addition to the files already mentioned, you will see:

test_mg.pathologic: a directory containing all the information that pathway-tools needs for pathway prediction.

test.pathways.txt: A simple text file of all the pathways that were predicted.

test.paprica-mg.txt: A very large text file of all hits for each read.  You probably want to delete this right away to save space.

test.paprica-mg.daa: A very large DIAMOND results file of all hits for each read.  You probably want to delete this right away to save space.

testcyc: A directory in ptools-local/pgdbs/local containing the PGDB and prediction reports.  It is worth spending some time here, and interacting with the PGDB using the pathway-tools GUI.


Antarctic wind-driven bacterial dispersal paper published

Thu, 03/24/2016 - 12:53
Frost flowers over young sea ice in the central Arctic Ocean. Photo Mattias Wietz.

Frost flowers over young sea ice in the central Arctic Ocean. Photo Mattias Wietz.

I’m happy to report that one of the appendices in my dissertation was just published in the journal Polar Biology.  The paper, titled Wind-driven distribution of bacteria in coastal Antarctica: evidence from the Ross Sea region, was a long time in coming.  I conceived of the idea back in 2010 when it looked like my dissertation would focus on the microbial ecology of frost flowers; delicate, highly saline, and microbially enriched structures found on the surface of newly formed sea ice.  Because marine bacteria are concentrated in frost flowers we wondered whether they might serve as source points for microbial dispersal.  This isn’t as far-fetched as it might seem; bacteria are injected into the atmosphere through a variety of physical processes, from wind lofting to bubble-bursting, and frost flowers have been implicated as the source of wind deposited sea salts in glaciers far from the coast.

At the time we’d been struggling to reliably sample frost flowers at our field site near Barrow, Alaska.  Frost flowers form readily there throughout the winter, but extremely difficult sea ice conditions make it hard to access the formation sites.  We knew that there were more accessible formation sites in the coastal Antarctic, so we initiated a one year pilot project to sample frost flowers from McMurdo Sound.  By comparing the bacterial communities in frost flowers, seawater, sea ice, terrestrial snow, and glaciers, we hoped to show that frost flowers were a plausible source of marine bacteria and marine genetic material to the terrestrial environment.  Because the coastal Antarctic contains many relic marine environments, such as the lakes of the Dry Valleys, the wind-driven transport of bacteria from frost flowers and other marine sources could be important for continued connectivity between relic and extant marine environments.

Frost flowers are readily accessible in McMurdo Sound throughout the winter, however, this does not mean that one can simply head out and sample them.  While the ice conditions are far more permissible than at Barrow, Alaska, the bureaucracy is also far more formidable.  The can-do attitude of our Inupiat guides in Barrow (who perceive every far-out field plan as a personal challenge) was replaced with the inevitable can’t-do attitude at McMurdo (this was 2011, under the Raytheon Antarctic Support Contract, and does not reflect on the current Lockheed Antarctic Support Contract, not to suggest that this attitude doesn’t persist).  Arriving in late August we were initially informed that our plan was much to risky without helicopter support, and that nothing could be done until mid-October when the helicopters began flying (we were scheduled to depart late October).  Pushing for a field plan that relied on ground transport ensnared us in various catch-22’s, such as (paraphrased from an actual conversation):

ASC representative: You can’t take a tracked vehicle to the ice edge, they’re too slow.

Me: Can we take a snowmobile to the ice edge?  That would be faster.  We do long mid-winter trips in the Arctic and it works out fine.

ASC representative: No, because you have to wear a helmet, and the helmets give you frostbite.  So you can only use a snowmobile when it’s warm out.

Ultimately we did access the ice edge by vehicle several times before the helicopters started flying, but the samples reported in this publication all came from a furious two week period in late October.  What we found really surprised us.

Sampling frost flowers is as easy as scraping them from the ice surface with a clean shovel into a sterile plastic bin.

Sampling frost flowers on a warm day in September, 2011, in McMurdo Sound.

Pull very short ice cores from young sea ice after sampling frost flowers. You can see one of the short cores in the lower right of the image.

Sampling some of the frost flowers used in this study over much thinner ice north of Ross Island in October, 2011.

 Shelly Carpenter.

Sampling “blue ice” on Taylor Glacier a few days earlier.  Comparing various terrestrial ice environments with the marine samples gave us some interesting insights on how bacteria are dispersed by strong winter winds in the coastal Antarctic.

There is ample evidence for the wind-driven transport of bacteria in this region but, contrary to our hypothesis, most of that material is coming from the terrestrial environment.  The major transportees were a freshwater cyanobacterium from the genus Pseudanabaena and a set of sulfur-oxidizing Gammaproteobacteria (GSO).  The cyanobacterium was pretty easy to understand; it forms mats in a number of freshwater lakes and meltponds in the region.  In the winter these freeze, and since snow cover is low, ice and microbial mats are ablated by strong winter winds.  Little pieces of mat are efficiently scattered all over, including onto the sea ice surface.


Easily dispersed: desiccated microbial mat in Lake Chad, 2005
(photo by A. Chiuchiolo), taken from

The GSO threw us for more of a loop; the most parsimonious explanation for their occurrence in frost flowers is that they came from hydrothermal features on nearby Mt. Erebus.  We did some nice analysis with wind vectors in the region and while you don’t get a lot of wind (integrated over time) to move material from Mt. Erebus to our sample sites, you do get some occasional very strong storms.

Wind velocity and magnitude for the study region in October, 2011. Taken from Bowman and Deming, 2016.

Wind velocity and magnitude for the study region in October, 2011. Taken from Bowman and Deming, 2016.

What all this means is that, consistent with other recent findings, there is high regional dispersal of microbes around the coastal Antarctic.  While I’m sure there are some endemic microbes occupying particularly unique niches, in general I expect microbes found in one part of the coastal Antarctic to be present in a similar environment in a different part of the coastal Antarctica.  There are however, quite a few ways to interpret this.  Bacteria and Archaea can evolve very fast, so the genome of a clonal population of (for example) wind deposited Pseudanabaena newly colonizing a melt pond can diverge pretty fast from the genome of the parent population.  This has a couple of implications.  First it means that the coastal Antarctic, with all it’s complex topography yet high degree of microbial connectivity, is an excellent place to explore the dynamics of microbial adaptation and evolution, particularly if we can put constraints on the colonization timeline for a given site (non trivial).  Second, it raises some questions about the propriety of commercially relevant microbes obtained from the continent.  The commercialization of the continent is probably inevitable (I hope it is not), perhaps the potential ubiquity of Antarctic microbes will provide some defense against the monopolization of useful strains, enzyme, and genes.