News aggregator

Roving the Abyss: It Takes a Team

The Future of Deep Science - Fri, 07/29/2016 - 16:13

The training cruise team’s first mission with the autonomous underwater vehicle (AUV) Sentry discovered an area of seafloor where methane is bubbling up, similar to this photo. The data will be used to plan the team’s next dive, this one with scientists inside a submersible. Photo: NOAA

By Bridgit Boulahanis 

Nothing about Sentry‘s transition from ship to seafloor is simple or easy, but the group of engineers behind the autonomous underwater vehicle approaches the process like an Olympic synchronized swimming team. They dive in head first, understand their positions and roles, approach with unabashed enthusiasm, and know how to get the job done. Their coordination and skill made my belly flop into Sentry coordination look like a graceful swan dive.

At the center of this team is Carl Kaiser, program manager for the Sentry AUV. Carl became the program manager in 2011 and made a point to be a part of this training cruise because he believes that young scientists need to understand the power and versatility of AUVs. His expertise in autonomous underwater technology is invaluable to our diverse research group, and his passion is palpable.

Carl Kaiser stands in front of Sentry during an earlier mission in which the AUV became entangled in rope . Photo courtesy of Carl Kaiser.

Carl Kaiser stands in front of Sentry during an earlier mission in which the AUV became entangled in rope. Photo courtesy of Carl Kaiser.

“As early career scientists, you all want to make your mark, and to become world class researchers you will have to establish yourselves uniquely within your field,” he says, while checking over a proposed dive survey. “We have barely scratched the surface of what Sentry can do—she wasn’t available to previous generations—and in the coming years we will see what autonomous vehicles are truly capable of.”

Seeing Sentry in action makes it easy to see why Carl and his cohort are so excited about their jobs. AUVs can be incredibly customizable: While we are primarily using Sentry to map the seafloor and take high resolution photos of our research sites, it also is capable of oxygen measurements, current speed tracking, magnetic anomaly measurements, sub-bottom profiling and plankton collection, just to name a few. It is programmed from a command station aboard the ship, given a set of locations and sampling goals, and set free overboard to complete its directive before returning to the surface.

If diving in Alvin, a submersible that can carry two scientists to the seafloor, is like an astronaut’s trip into space, Sentry is similar to a planetary rover—nothing can replace the appeal of manned missions, but most of our real discoveries come from slightly less glamorous but incredibly important unmanned probes. Last night, while Sentry floated through the abyss gathering crucial data to help us understand the ocean, somewhere incredibly far away Curiosity roved across the Martian landscape, similarly transmitting information back to the scientists at NASA. I like to think that if Sentry and Curiosity could communicate across their vast and inhospitable separation they would end up close friends.

 Bridgit Boulahanis

The autonomous underwater vehicle Sentry is controlled from this mobile command center. Photo: Bridgit Boulahanis

Our first Sentry mission returned this morning and was a rousing success. Right now, scientists aboard the ship and our colleagues on shore are excitedly processing the data. We will use the maps, photos and water column data that we extract from this to plan tomorrow’s Alvin dive.

Looking at the map the Sentry operations group has generated from last night’s dive, it is apparent that this powerful tool is going to play a key role in the scientific goals of many of us aboard this training cruise.

In fact, our first scientific meeting of the day started with chief scientist Adam Skarke holding up Sentry data showing that we have identified a spot where methane gas is currently seeping out of the ocean floor, leading to excited applause from everyone in the room. Those methane gas bubbles will be where we start our Alvin dive tomorrow, and they will be the research focus of many of the scientists here in the years to come.

Bridgit Boulahanis is a marine geophysics graduate student at Columbia University’s Lamont-Doherty Earth Observatory. Her research utilizes multichannel seismic reflection and refraction studies as well as multibeam mapping data to explore Mid-Ocean Ridge dynamics, submarine volcanic eruptions, and how oceanic crustal accretion changes through time. Read more about the training cruise in her first post.




Does the Disappearance of Sea Ice Matter? - New York Times

Featured News - Fri, 07/29/2016 - 07:52
Lamont's Marco Tedesco views the Arctic as a systems engineer would. He has been trying to “close the loop” and connect the exceedingly complex interactions that drive the northern climate system, which includes its sea ice, atmosphere and ocean circulations, and land ice.

When Doing Science at Sea, Prepare to Adapt

The Future of Deep Science - Fri, 07/29/2016 - 05:40
Lamont's Bridgit Boulahanis, <em>Sentry</em> Coordinator for the University-National Oceanographic Laboratory System (UNOLS) Deep-Submergence Science Leadership Cruise, gives a presentation aboard ship. <a href=""><em>Sentry</em> is a AUV</a> the team is using to explore the sea floor.

Lamont’s Bridgit Boulahanis, Sentry coordinator for the University-National Oceanographic Laboratory System (UNOLS) Deep-Submergence Science Leadership Cruise, gives a presentation aboard ship. Sentry is a AUV the team is using to explore the sea floor.

By Bridgit Boulahanis

My first official day as Sentry coordinator started with a 6 a.m. gathering on deck to watch the R/V Atlantis slide away from our dock at the Woods Hole Oceanographic Institution. Clutching my thermos of coffee, I stumbled onto the main deck to find Chief Scientist Adam Skarke looking alert enough to suggest he’d been up for hours.

“Everyone,” he called to the gathered crew of young scientists, “our departure is being delayed due to fog. We are now scheduled to leave port at 10:30 a.m.” The deck was smothered by mist, rendering it impossible for us to even successfully wave goodbye to the on-shore team who had gathered to see us off.

Adam’s announcement is met with a fair amount of concern from most of the scientists on board. We are an eager bunch, with a full schedule of data collection booked 24 hours a day once we arrive at our first science station. Skarke is in training too, but as chief scientist, he understands the need to keep his team inspired. After assuring us that most of our sampling plans should not be significantly hindered, he reminded us of what will likely be our motto in the coming days: “Science at sea requires constant adaptation.”

 Bridgit Boulahanis

The underwater autonomous vehicle Sentry in the morning fog. Photo: Bridgit Boulahanis

Adam’s words rang particularly true—later in the morning, I sat with him and the Sentry engineers reevaluating the dive we planned for the night. We would be arriving on station only two hours later than scheduled, but that still meant we would need to make cuts in our mapping plan, according to Carl Kaiser (Sentry expedition leader) and Zac Berkowtiz (Sentry expedition leader-in-training), from their command center in the Hydrolab. Together we discussed options: We could make a smaller map, we could allow larger gaps in our high resolution photos of the seafloor, or we could change the shape of our survey altogether. In the end, we decided to keep our large map and high-resolution data, but we will have to take photos over a smaller region of seafloor.

Hours later, after leaving a finally sunny Woods Hole port and conducting several safety drills, the scientists on board were once again busily planning missions and creating data collection spreadsheets. We gathered in the ship’s library and shared our mission plans for the coming days, and then at 9:50 p.m., we completed our first launch of Sentry. Barring any new “opportunities for adaptation,” she will return to the surface at 6 a.m. on Day 2 with the data we requested. For now, we have to keep our fingers crossed and wait.

Bridgit Boulahanis is a marine geophysics graduate student at Columbia University’s Lamont-Doherty Earth Observatory. Her research utilizes multichannel seismic reflection and refraction studies as well as multibeam mapping data to explore Mid-Ocean Ridge dynamics, submarine volcanic eruptions, and how oceanic crustal accretion changes through time. Read more about the training cruise in her first post.



Going Deep for Science

The Future of Deep Science - Thu, 07/28/2016 - 11:34
 Bridgit Boulhanis

Bridgit Boulahanis will be planning the deep sea explorations and sea floor mapping work of the AUV Sentry during this training cruise. Photo: Bridgit Boulahanis

By Bridgit Boulahanis

“You are the future of deep submergence science,” mentors Dan Fornari and Cindy Van Dover tell our group of 24 ocean scientists gathered for our first pre-cruise meetings for the University-National Oceanographic Laboratory System (UNOLS) Deep-Submergence Science Leadership Cruise. Deep submergence science can mean a plethora of things, and this is reflected in the varied interests and goals of our group. Present in the room are Ph.D. students aiming to snatch octopuses from their deep-sea homes, postdoctoral researchers hoping to measure near-bottom ocean currents, associate professors attempting to record the sounds made by methane bubbles as they seep out of the seafloor, and researchers of many career stages and scientific interests in between.

We’ll have some help in the deep, and we’re all pretty excited about it: This training cruise aims to give early career scientists experience with two deep submergence assets—Alvin, a Human Occupied Vehicle (HOV) that can carry two scientists at a time to the ocean floor, and Sentry, an Autonomous Underwater Vehicle (AUV) that can roam for hours, collecting images and data. The scientists all came aboard R/V Atlantis this week with data collection goals designed to use these incredible machines while advancing science.

 Bridgit Boulahanis

Marine scientists get the rare opportunity to explore the sea floor up close in the HOV Alvin. Photo: Bridgit Boulahanis

Alvin is famous for its role in exploring the wreckage of the RMS Titanic, but is an icon among scientists for its unique direct observation opportunities and sample collection capabilities. Alvin’s maneuverable arms can grab rocks, corals, and critters, and it has a variety of sensors to pick up information about the water surrounding the submarine. Of course, the draw goes well beyond data—scientists who strap themselves into Alvin for a dive get to directly experience the environment that our data describes. It’s often compared to an astronaut’s trip into space. Alvin gives scientists the opportunity to be immersed in the world we have dedicated our lives to but otherwise cannot explore first-hand.

Sentry, though less well known than Alvin, is no less powerful a tool for scientific discovery. It can be illustrative to think of Sentry as a submarine drone—scientists plan out missions in advance and provide Sentry with a map of locations for data collection before launching the AUV to conduct operations without human intervention for upwards of 40 hours. Sentry can function at depths up to 6,000 meters, and can be customized to collect data for versatile science goals. For a geophysicist, this is where the real excitement lives. We’ve all heard the statistic that less than 5 percent of the ocean has been explored, and Sentry has the power to change that, creating maps and taking photos at a resolution and scale that is impossible by almost any other means. Every time Sentry is launched we make another dent in that 95 percent left to be explored, and so every mission feels like a battle won for science. We will be launching Sentry at night to collect high-resolution photos and data mapping the seafloor. That valuable data will then inform Alvin’s later dives.

Bridgit Boulahanis

Bridgit Boulahanis

My role on this cruise is as Sentry’s coordinator, so I will be helping plan missions and process the mapping data that Sentry collects. I will also be acting as science liaison to the Sentry operations team. The first stop is a fascinating patch of seafloor called Veatch Canyon 2, where gas bubbles leach out of the seafloor and sea life has been spotted gathering. Past missions have identified corals, mussels and bacterial mats at this site, all of which are indicative of active gas seepage. We leave port at 6 a.m., and after 13 hours of transit we will arrive above our launch site—that is when my team will have to jump into action, getting Sentry overboard as quickly as possible to maximize our mapping and photo-taking time.

Stay tuned for updates on research life at sea, what it is like to work with Alvin and Sentry, and why all of this is so important for the future of marine science.

Bridgit Boulahanis is a marine geophysics graduate student at Columbia University’s Lamont-Doherty Earth Observatory. Her research utilizes multichannel seismic reflection and refraction studies as well as multibeam mapping data to explore Mid-Ocean Ridge dynamics, submarine volcanic eruptions, and how oceanic crustal accretion changes through time.





The Definition of an Explorer - The Low Down

Featured News - Thu, 07/21/2016 - 12:00
In this audio podcast, Lamont's Hugh Ducklow, lead researcher for Antarctica's Palmer Station LTER, talks to The Explorers Club about the changing state of our polar regions.

'Black and Bloom' Explores Algae's Role in Arctic Melting - Scientific American

Featured News - Mon, 07/18/2016 - 12:00
Scientific American talks with Lamont's Marco Tedesco, who studies melting on Greenland, about a new project exploring how microorganisms help determine the pace of Arctic melting.

Cyclones Set to Get Fiercer as World Warms - Climate News Network

Featured News - Sat, 07/16/2016 - 12:00
A new analysis of cyclone data and computer climate modeling, led by Lamont's Adam Sobel, Suzana Camargo, Allison Wing and Chia-Ying Lee, indicates that global warming is likely to intensify the destructive power of tropical storms.

Where Are the Hurricanes? - New York Times

Featured News - Fri, 07/15/2016 - 11:59
In an Op/Ed article in the New York Times, Lamont's Adam Sobel explains why hurricanes are likely to become more intense with climate change and how recent history fits scientists' expectations.

Extraordinary Years Now the Normal Years: Scientists Survey Radical Melt in Arctic - Washington Post

Featured News - Wed, 07/13/2016 - 18:32
A group of scientists studying a broad range of Arctic systems — from sea ice to permafrost to the Greenland ice sheet — gathered in D.C. to lay out just how extreme a year 2016 has been so far for the northern cap of the planet. “I see the situation as a train going downhill,” said Lamont's Marco Tedesco. “And the feedback mechanisms in the Arctic [are] the slope of your hill. And it gets harder and harder to stop it.”

Global Risks and Research Priorities for Coastal Subsidence - Eos

Featured News - Wed, 07/13/2016 - 12:00
The risk of rapid coastal subsidence to infrastructure and economies is global and is most acute in large river deltas, which are home to about 500 million people. An international community of researchers is calling attention to the need for better measurements and modeling and linking the science with its socioeconomic implications, Lamont's Michael Steckler and colleagues write.

How I learned to stop worrying and love subsampling (rarifying)

Chasing Microbes in Antarctica - Tue, 07/12/2016 - 12:53

I have had the 2014 paper “Waste Not, Want Not: Why Rarefying Microbiome Data is Inadmissable” by McMurdie and Holmes sitting on my desk for a while now.  Yesterday I finally got around to reading it and was immediately a little skeptical as a result of the hyperbole with which they criticized the common practice of subsampling* libraries of 16S rRNA gene reads during microbial community analysis.  The logic of that practice proceeds like this:

To develop 16S rRNA gene data (or any other marker gene data) describing a microbial community we generally collect an environmental sample, extract the DNA, amplify it, and sequence the amplified material.  The extract might contain the DNA from 1 billion microbial cells present in the environment.  Only a tiny fraction (<< 1 %) of these DNA molecules actually get sequenced; a “good” sequence run might contain only tens of thousands of sequences per sample.  After quality control and normalizing for multiple copies of the 16S rRNA gene it will contain far fewer.  Thus the final dataset contains a very small random sample of the original population of DNA molecules.

Most sequence-based microbial ecology studies involve some kind of comparison between samples or experimental treatments.  It makes no sense to compare the abundance of taxa between two datasets of different sizes, as the dissimilarity between the datasets will appear much greater than it actually is.  One solution is to normalize by dividing the abundance of each taxa by the total reads in the dataset to get their relative abundance.  In theory this works great, but has the disadvantage that it does not take into account that a larger dataset has sampled the original population of DNA molecules deeper.  Thus more rare taxa might be represented.  A common practice is to reduce the amount of information present in the larger dataset by subsampling to the size of the smaller dataset.  This attempts to approximate the case where both datasets undertake the same level of random sampling.

McMurdie and Holmes argue that this approach is indefensible for two common types of analysis; identifying differences in community structure between multiple samples or treatments, and identifying differences in abundance for specific taxa between samples and treatments.  I think the authors do make a reasonable argument for the latter analysis; however, at worst the use of subsampling and/or normalization simply reduces the sensitivity of the analysis.  I suspect that dissimilarity calculations between treatments or samples using realistic datasets are much less sensitive to reasonable subsampling than the authors suggest.  I confess that I am (clearly) very far from being a statistician and there is a lot in McMurdie and Holmes, 2014 that I’m still trying to digest.  I hope that our colleagues in statistics and applied mathematics continue optimizing these (and other) methods so that microbial ecology can improve as a quantitative science.  There’s no need however, to get everyone spun up without a good reason.  To try and understand if there is a good reason, at least with respect to my data and analysis goals, I undertook some further exploration.  I would strongly welcome any suggestions, observations, and criticisms to this post!

My read of McMurdi and Homes is that the authors object to subsampling because it disregards data that is present that could be used by more sophisticated (i.e. parametric) methods to estimate the true abundance of taxa.  This is true; data discarded from larger datasets does have information that can be used to estimate the true abundance of taxa among samples.  The question is how much of a difference does it really make?  McMurdi and Holmes advocate using methods adopted from transcriptome analysis.  These methods are necessary for transcriptomes because 1) the primary objective of the study is not usually to see if one transcriptome is different from another, but which genes are differentially expressed and 2) I posit that the abundance distribution of transcript data is different than the taxa abundance distribution.  An example of this can be seen in the plots below.

Taxon abundance, taken from a sample that I’m currently working with.













Transcript abundance by PFAM, taken from a sample (selected at random) from the MMETSP.

In both the cases the data is log distributed, with a few very abundant taxa or PFAMs and many rare ones.  What constitutes “abundant” in the transcript dataset however, is very different than “abundant” in the community structure dataset.  The transcript dataset is roughly half the size (n = 9,000 vs. n = 21,000), nonetheless the most abundant transcript has an abundance of 209.  The most abundant OTU has an abundance of 6,589, and there are several very abundant OTUs.  Intuitively this suggests to me that the structure of the taxon dataset is much more reproducible via subsampling than the structure of the transcript dataset, as the most abundant OTUs have a high probability of being sampled.  The longer tail of the transcript data contributes to this as well, though of course this tail is controlled to a large extent by the classification scheme used (here PFAMs).

To get an idea of how reproducible the underlying structure was for the community structure and transcript data I repeatedly subsampled both (with replacement) to 3000 and 1300 observations, respectively.  For the community structure data this is about the lower end of the sample size I would use in an actual analysis – amplicon libraries this small are probably technically biased and should be excluded (McMurdi and Holmes avoid this point at several spots in the paper).  The results of subsampling are shown in these heatmaps, where each row is a new subsample.  For brevity only columns summing to an abundance > 100 are shown.

otu_abundance_heat trans_abundance_heatIn these figures the warmer colors indicate a higher number of observations for that OTU/PFAM.  The transcript data is a lot less reproducible than the community structure data; the Bray-Curtis dissimilarity across all iterations maxes out at 0.11 for community structure and 0.56 for the transcripts.  The extreme case would be if the data were normally distributed (i.e. few abundant and few rare observations, many intermediate observations).  Here’s what subsampling does to normally distributed data (n = 21,000, mean = 1000, sd = 200):


If you have normally distributed data don’t subsample!

For the rest of us it seems that for the test dataset used here community structure is at least somewhat reproducible via subsampling.  There are differences between iterations however, what does this mean in the context of the larger study?

The sample was drawn from a multiyear time series from a coastal marine site.  The next most similar sample (in terms of community composition and ecology) was, not surprisingly, a sample taken one week later.  By treating this sample in an identical fashion, then combining the two datasets, it was possible to evaluate how easy it is to tell the two samples apart after subsampling.  In this heatmap the row colors red and black indicate iterations belonging to the two samples:

Clustering of repeated subsamplings from two similar samples.  Sample identity is given by the red or black color along the y-axis.


As this heatmap shows, for these two samples there is perfect fidelity.  Presumably with very similar samples this would start to break down, determining how similar samples need to be before they cannot be distinguished at a given level of subsampling would be a useful exercise.  The authors attempt to do this in Simlation A/Figure 5 in the paper, but it isn’t clear to me why their results are so poor – particularly given very different sample types and a more sophisticated clustering method than I’ve applied here.

As a solution – necessary for transcript data, normally distributed data, or for analyses of differential abundance, probably less essential for comparisons of community structure – the authors propose a mixture model approach that takes in account variance across replicates to estimate “real” OTU abundance.  Three R packages that can do this are mentioned; edgeR, DESeq, and metagenomeSeq.  The problem with these methods – as I understand them – is that they require experimental replicates.  According to the DESeq authors, technical replicates should be summed, and samples should be assigned to treatment pools (e.g. control, treatment 1, treatment 2…).  Variance is calculated within each pool and this is used to to model differences between pools.  This is great for a factor-based analysis, as is common in transcriptome analysis or human microbial ecology studies.  If you want to find a rare, potentially disease-causing strain differently present between a healthy control group and a symptomatic experimental group for example, this is a great way to go about it.

There are many environmental studies for which these techniques are not useful however, as it may be impractical to collect experimental replicates.  For example it is both undesirable and impractical to conduct triplicate or even duplicate sampling in studies focused on high-resolution spatial or temporal sampling.  Sequencing might be cheap now, but time and reagents are not.  Some of the methods I’m working with are designed to aggregate samples into higher-level groups – at which point these methods could be applied by treating within-group samples as “replicates” – but this is only useful if we are interested in testing differential abundance between groups (and doesn’t solve the problem of needing to get samples grouped in the first place).

These methods can be used to explore differential abundance in non-replicated samples, however, they are grossly underpowered when used without replication.  Here’s an analysis of differential abundance between the sample in the first heatmap above and its least similar companion from the same (ongoing) study using DESeq.  You can follow along with any standard abundance table where the rows are samples and the columns are variables.

library(DESeq) ## DESeq wants to oriented opposite how community abundance ## data is normally presented (e.g. to vegan) data.dsq <- t(data) ## DESeq requires a set of conditions which are factors. Ideally ## this would be control and treatment groups, or experimental pools ## or some such, but we don't have that here. So the conditions are ## unique column names (which happen to be dates). conditions <- factor(as.character(colnames(data.dsq))) ## As a result of 16S rRNA gene copy number normalization abundance ## data is floating point numbers, convert to integers. data.dsq <- ceiling(data.dsq, 0) ## Now start working with DESeq. data.ct <- newCountDataSet(, conditions = conditions) data.size <- estimateSizeFactors(data.ct) ## This method and sharing mode is required for unreplicated samples. data.disp <- estimateDispersions(data.size, method = 'blind', sharingMode="fit-only") ## And now we can execute a test of differential abundance for the ## two samples used in the above example. test <- nbinomTest(data.disp, '2014-01-01', '2014-01-06') test <- na.omit(test) ## Plot the results. plot(test$baseMeanA, test$baseMeanB,      #log = 'xy',      pch = 19,      ylim = c(0, max(test$baseMeanB)),      xlim = c(0, max(test$baseMeanA)),      xlab = '2014-01-01',      ylab = '2009-12-14') abline(0, 1) ## If we had significant differences we could then plot them like this: points(test$baseMeanA[which(test$pval < 0.05)],        test$baseMeanB[which(test$pval < 0.05)],        pch = 19,        col = 'red')



As we would expect there are quite a few taxa present in high abundance in one sample and not the other, however, none of the associated p-values are anywhere near significant.  I’m tempted to try to use subsampling to create replicates, which would allow an estimate of variance across subsamples and access to greater statistical power.  This is clearly not as good as real biological replication, but we have to work within the constraints of our experiments, time, and funding…

*You might notice that I’ve deliberately avoided using the terms “microbiome” and “rarefying” here.  In one of his comics Randall Munroe asserted that the number of made up words in a book is inversely proportional to the quality of the book, similarly I strongly suspect that the validity of a sub-discipline is inversely proportional to the number of jargonistic words that community makes up to describe its work.  As a member of said community, what’s wrong with subsampling and microbial community??

New Earthquake Threat Could Lurk Under 140 Million People - National Geographic

Featured News - Mon, 07/11/2016 - 12:00
A megathrust fault could be lurking underneath Myanmar, Bangladesh, and India, exposing millions of people to the risk of a major earthquake, according to research led by Lamont's Michael Steckler.

Scientists Find Glacier Bay Landslide Still Active Days Later - KHNS Radio

Featured News - Fri, 07/08/2016 - 12:00
Lamont's Colin Stark visited the Glacier Bay landslide and said closer inspection revealed two big discoveries: the slide was still active days later, and the original landslide was so powerful it pushed rock and dirt up the sides of the valley almost 300 feet.

Measured Breath: How Best to Monitor Pollution - WNYC

Featured News - Thu, 07/07/2016 - 15:52
It's the second summer for the Biking While Breathing project which looks at the impact of air pollution on exercise in New York City. This year, researchers are considering going cheap. Cites Steve Chillrud's work.

Exploring genome content and genomic character with paprica and R

Chasing Microbes in Antarctica - Thu, 07/07/2016 - 13:07

The paprica pipeline was designed to infer the genomic content and genomic characteristics of a set of 16S rRNA gene reads.  To enable this the paprica database organizes this information by phylogeny for many of the completed genomes in Genbank.  In addition to metabolic inference this provides an opportunity to explore how genome content and genomic characteristics are organized phylogenetically.  The following is a brief analysis of some genomic features using the paprica database and R.  If you aren’t familiar with the paprica database this exercise will also familiarize you with some of its content and its organization.

The paprica pipeline and database can be obtained from Github here.  In this post I’ll be using the database associated with version 0.3.1.  The necessary files from the bacteria database (one could also conduct this analysis on the much smaller archaeal database) can be read into R as such:

## Read in the pathways associated with the terminal nodes on the reference tree path <- read.csv('paprica/ref_genome_database/bacteria/terminal_paths.csv', row.names = 1) path[] <- 0 ## Read in the data associated with all completed genomes in Genbank data <- read.csv('paprica/ref_genome_database/bacteria/', row.names = 1) ## During database creation genomes with duplicate 16S rRNA genes were removed, ## so limit to those that were retained data <- data[row.names(data) %in% row.names(path),] ## "path" is ordered by clade, meaning it is in top to bottom order of the reference tree, ## however, "data" is not, so order it here data <- data[order(data$clade),]

One fun thing to do at this point is to look at the distribution of metabolic pathways across the database.  To develop a sensible view it is best to cluster the pathways according to which genomes they are found in.

## The pathway file in the database is binary, so we use Jaccard for distance library('vegan') path.dist <- vegdist(t(path), method = 'jaccard') # distance between pathways (not samples!) path.clust <- hclust(path.dist)

The heatmap function is a bit cumbersome for this large matrix, so the visualization can be made using the image function.

## Set a binary color scheme image.col <- colorRampPalette(c('white', 'blue'))(2) ## Image will order matrix in ascending order, which is not what we want here! image(t(data.matrix(path))[rev(path.clust$order),length(row.names(path)):1],       col = image.col,       ylab = 'Genome',       xlab = 'Pathway',       xaxt = 'n',       yaxt = 'n') box()
The distribution of metabolic pathways across all 3,036 genomes in the v0.3.1 paprica database.

The distribution of metabolic pathways across all 3,036 genomes in the v0.3.1 paprica database.

There are a couple of interesting things to note in this plot.  First, we can see the expected distribution of core pathways present in nearly all genomes, and the interesting clusters of pathways that are unique to a specific lineage.  For clarity row names have been omitted from the above plot, but from within R you can pull out the taxa or pathways that interest you easily enough.  Second, there are some genomes that have very few pathways.  There are a couple of possible reasons for this that can be explored with a little more effort.  One possibility is that these are poorly annotated genomes, or at least the annotation didn’t associate many or any coding sequences with either EC numbers or GO terms – the two pieces of information Pathway-Tools uses to predict pathways during database construction.  Another possibility is that these genomes belong to obligate symbionts (either parasites or beneficial symbionts).  Obligate symbionts often have highly streamlined genomes and few complete pathways.  We can compare the number of pathways in each genome to other genome characteristics for additional clues.

A reasonable assumption is that the number of pathways in each genome should scale with the size of the genome.  Large genomes with few predicted pathways might indicate places where the annotation isn’t compatible with the pathway prediction methods.

## Plot the number of pathways as a function of genome size plot(rowSums(path) ~ data$genome_size,      ylab = 'nPaths',      xlab = 'Genome size') ## Plot P. ubique as a reference point select <- grep('Pelagibacter ubique HTCC1062', data$organism_name) points(rowSums(path)[select] ~ data$genome_size[select],        pch = 19,        col = 'red')
The number of metabolic pathways predicted as a function of genome size for the genomes in the paprica database.

The number of metabolic pathways predicted as a function of genome size for the genomes in the paprica database.

That looks pretty good.  For the most part more metabolic pathways were predicted for larger genomes, however, there are some exceptions.  The red point gives the location of Pelagibacter ubique HTCC1062.  This marine bacterium is optimized for life under energy-limited conditions.  Among its adaptations are a highly efficient and streamlined genome.  In fact it has the smallest genome of any known free-living bacterium.  All the points below it on the x-axis are obligate symbionts; these are dependent on their host for some of their metabolic needs.  There are a few larger genomes that have very few (or even no) pathways predicted.  These are the genomes with bad, or at least incompatible annotations (or particularly peculiar biochemistry).

The other genome parameters in paprica are the number of coding sequences identified (nCDS), the number of genetic elements (nge), the number of 16S rRNA gene copies (n16S), GC content (GC), and phi; a measure of genomic plasticity.  We can make another plot to show the distribution of these parameters with respect to phylogeny.

## Grab only the data columns we want <- data[,c('n16S', 'nge', 'ncds', 'genome_size', 'phi', 'GC')] ## Make the units somewhat comparable on the same scale, a more ## careful approach would log-normalize some of the units first <- decostand(, method = 'standardize') <- decostand(, method = 'range') ## Plot with a heatmap heat.col <- colorRampPalette(c('blue', 'white', 'red'))(100) heatmap(data.matrix(,       margins = c(10, 20),       col = heat.col,       Rowv = NA,       Colv = NA,       scale = NULL,       labRow = 'n',       cexCol = 0.8)
Genomic parameters organized by phylogeny.

Genomic parameters organized by phylogeny.

Squinting at this plot it looks like GC content and phi are potentially negatively correlated, which could be quite interesting.  These two parameters can be plotted to get a better view:

plot($phi ~$GC,      xlab = 'GC',      ylab = 'phi')
The phi parameter of genomic plasticity as a function of GC content.

The phi parameter of genomic plasticity as a function of GC content.

Okay, not so much… but I think the pattern here is pretty interesting.  Above a GC content of 50 % there appears to be no relationship, but these parameters do seem correlated for low GC genomes.  This can be evaluated with linear models for genomes above and below 50 % GC.

gc.phi.above50 <- lm($phi[which($GC >= 50)] ~$GC[which($GC >= 50)]) gc.phi.below50 <- lm($phi[which($GC < 50)] ~$GC[which($GC < 50)]) summary(gc.phi.above50) summary(gc.phi.below50) plot($phi ~$GC,      xlab = 'GC',      ylab = 'phi',      type = 'n') points($phi[which($GC >= 50)] ~$GC[which($GC >= 50)],        col = 'blue') points($phi[which($GC < 50)] ~$GC[which($GC < 50)],        col = 'red') abline(gc.phi.above50,        col = 'blue') abline(gc.phi.below50,        col = 'red') legend('bottomleft',        bg = 'white',        legend = c('GC >= 50',                   'GC < 50'),        col = c('blue', 'red'),        pch = 1)

Genomic plasticity (phi) as a function of GC content for all bacterial genomes in the paprica database.

As expected there is no correlation between genomic plasticity and GC content for the high GC genomes (R2 = 0) and a highly significant correlation for the low GC genomes (albeit with weak predictive power; R2 = 0.106, p = 0).  So what’s going on here?  Low GC content is associated with particular microbial lineages but also with certain ecologies.  The free-living low-energy specialist P. ubique HTCC1062 has a low GC content genome for example, as do many obligate symbionts regardless of their taxonomy (I don’t recall if it is known why this is).  Both groups are associated with a high degree of genomic modification, including genome streamlining and horizontal gene transfer.

As Glaciers Melt in Alaska, Landslides Follow - New York Times

Featured News - Tue, 07/05/2016 - 12:00
Cites work by Colin Stark and Göran Ekström.

What Triggered the Massive Glacier Bay Landslide? - CS Monitor

Featured News - Tue, 07/05/2016 - 12:00
Seismic recordings registered a massive landslide in Alaska's Glacier Bay National Park, and scientists are studying how the region's geology and environmental change are elevating the risk of mountain landslides. Cites work by Colin Stark and Göran Ekström.

Massive Landslide Crashes onto Glacier in Southeast Alaska - Alaska Dispatch

Featured News - Sat, 07/02/2016 - 12:00
More than 100 million tons of rock slid down a mountainside in Southeast Alaska on Tuesday morning, sending debris miles across a glacier below and a cloud of dust into the air. Lamont's Colin Stark and colleagues analyzed the landslide through its seismic waves.

Crippled Atlantic Conveyor Linked to Ice Age Climate Change - Science

Featured News - Thu, 06/30/2016 - 14:55
Slowdowns of the Atlantic meridional overturning circulation have long been suspected as a cause of the climate swings during the last ice age, but never definitively shown, until now. The new study “is the best demonstration that this indeed happened,” says Lamont's Jerry McManus.



Subscribe to Lamont-Doherty Earth Observatory aggregator