News aggregator

Phylogenetic placement re-re-visited

Chasing Microbes in Antarctica - Fri, 05/13/2016 - 15:48

Disclaimer: I banged this out fast from existing scripts to help some folks, but haven’t tested it yet.  Will do that shortly, in the meantime, be careful!


I use phylogenetic placement, namely the program pplacer, in a lot of my publications.  It is also a core part of of the paprica metabolic inference pipeline.  As a result I field a lot questions from people trying to integrate pplacer into their own workflows.  Although the Matsen group has done an excellent job with documentation for pplacer, guppy, and taxtastic, the three programs you need to work with to do phylogenetic placement from start to finish (see also EPA), there is still a steep learning curve for new users.  In the hope of bringing the angle of that curve down a notch or two, and updating my previous posts on the subject (here and here), here is a complete, start to finish example of phylogenetic placement, using 16S rRNA gene sequences corresponding to the new tree of life recently published by Hug et al.  To follow along with the tutorial start by downloading the sequences here.

You can use any number of alignment and tree building programs to create a reference tree for phylogenetic placement.  I strongly recommend using RAxML and Infernal.  After a lot of experimentation this combination seems to be produce the most correct topologies and best supported trees.  You should be aware that no 16S rRNA gene tree (or any other tree) is absolutely “correct” for domain-level let alone life-level analyses, but nothing in life is perfect.  While you’re installing software I also recommend the excellent utility Seqmagick.  Finally, you will need a covariance model of the 16S rRNA gene to feed into Infernal.  You can find that at the Rfam database here.

The workflow will follow these steps:

  1. Create an alignment of the reference sequences with Infernal
  2. Create a phylogenetic tree of the alignment
  3. Create a reference package from the alignment, tree, and stats file
  4. Proceed with the phylogenetic placement of your query reads

Create an alignment of the reference sequences

The very first thing that you need to do is clean your sequence names of any wonky punctuation.  This is something that trips up almost everyone.  You should really have no punctuation in the names beyond “_”, and no spaces!

tr " -" "_" < hug_tol.fasta | tr -d "%\,;():=.\\*[]\"\'" > hug_tol.clean.fasta

Next create an alignment from the cleaned file.  I always like to degap first, although it shouldn’t matter.

## Degap seqmagick mogrify --ungap hug_tol.clean.fasta ## Align cmalign --dna -o hug_tol.clean.align.sto --outformat Pfam hug_tol.clean.fasta ## Convert to fasta format seqmagick convert hug_tol.clean.align.sto hug_tol.clean.align.fasta

Build the reference tree

At this point you should have a nice clean DNA alignment in the fasta format.  Now feed it to RAxML to build a tree.  Depending on the size of the alignment this can take a little bit.  I know you’ve read the entire RAxML manual so of course you are already aware that adding additional cpus won’t help…

raxmlHPC-PTHREADS-AVX2 -T 8 -m GTRGAMMA -s hug_tol.clean.align.fasta -n ref.tre -f d -p 12345

I like having a sensibly rooted tree; it’s just more pleasing to look at.  You can do this manually, or you can have RAxML try to root the tree for you.

raxmlHPC-PTHREADS-AVX2 -T 2 -m GTRGAMMA -f I -t RAxML_bestTree.ref.tre -n root.ref.tre

Okay, now comes the tricky bit.  Clearly you’d like to have some support values on your reference tree, but the Taxtastic program that we will use to build the reference tree won’t be able to read the RAxML stats file if it includes confidence values.  The work around is to build a second tree with confidence values.  You will feed this tree to Taxtastic with the stats file from the tree we already generated.

## Generate confidence scores for tree raxmlHPC-PTHREADS-AVX2 -T 8 -m GTRGAMMA -f J -p 12345 -t RAxML_rootedTree.root.ref.tre -n conf.root.ref.tre -s hug_tol.clean.align.fasta

Now we can use the alignment, the rooted tree with confidence scores, and the stats file without confidence scores to create our reference package.

taxit create -l 16S_rRNA -P hug_tol.refpkg --aln-fasta hug_tol.clean.align.fasta --tree-stats RAxML_info.ref.tre --tree-file RAxML_fastTreeSH_Support.conf.root.ref.tre

Align the query reads

At this point you have the reference package and you can proceed with analyzing some query reads!  The first step is to align the query reads in exactly the same fashion as the reference sequences.  This is important as the alignments will be merged later.

## Clean the names tr " -" "_" < query.fasta | tr -d "%\,;():=.\\*[]\"\'" > query.clean.fasta ## Remove any gaps seqmagick mogrify --ungap ## Align cmalign --dna -o query.clean.align.sto --outformat Pfam query.clean.fasta

Now we use the esl-alimerge command, included with Infernal, to merge the query and reference alignments.

## Merge alignments esl-alimerge --outformat pfam --dna -o query.hug_tol.clean.align.sto query.clean.align.sto hug_tol.refpkg/hug_tol.clean.align.sto ## Convert to fasta seqmagick convert query.hug_tol.clean.align.sto

Phylogenetic placement

Now we’re on the home stretch, we can execute the phylogenetic placement itself!  The flags are important here, so it’s worth checking the pplacer documentation to insure that your goals are consistent with mine (get a job, publish some papers?).  You can probably accept most of the flags for the previous commands as is.

pplacer -o query.hug_tol.clean.align.jplace -p --keep-at-most 20 -c hug_tol.refpkg query.hug_tol.clean.align.fasta

At this point you have a file named query.hug_tol.clean.align.jplace.  You will need to use guppy to convert this json-format file to information that is readable by human.  The two most useful guppy commands (in my experience) for a basic look at your data are:

## Generate an easily parsed csv file of placements, with only a single placement reported for each ## query read. guppy to_csv --point-mass --pp -o query.hug_tol.clean.align.csv query.hug_tol.clean.align.jplace ## Generate a phyloxml tree with edges fattened according to the number of placements. guppy fat --node-numbers --point-mass --pp -o query.hug_tol.clean.align.phyloxml query.hug_tol.clean.align.jplace

New Mercury Maps Showcase Planet's Striking Features -

Featured News - Mon, 05/09/2016 - 12:00
The first global digital-elevation model of Mercury reveals a striking landscape of basins and lava plains. Lamont Director Sean Solomon was principal investigator on the MESSENGER mission and discussed the data MESSENGER captured.

Slow-Motion Earthquakes May Also Lead to Tsunamis - Business Standard

Featured News - Fri, 05/06/2016 - 10:26
Slow-motion earthquakes or "slow-slip events" can rupture the shallow portion of a fault that also moves in large, tsunami-generating earthquakes. A new study involving Lamont's Spahr Webb examines a slow-slip event off New Zealand.

Maureen Raymo Elected to National Academy of Sciences - National Academy of Sciences

Featured News - Tue, 05/03/2016 - 16:21
Marine geologist and paleoceanographer Maureen Raymo was among 84 scientists elected for membership in the National Academy of Sciences, one of the highest honors awarded to engineers and scientists in the United States.

Inside West Virginia's Battle Over Teaching Climate Change - Climate Wire

Featured News - Fri, 04/29/2016 - 12:00
In a coal state struggling with environmental regulations and a fiscal crisis, teaching climate science has hit a nerve. Climate Wire spoke with Lamont Special Research Scientist Kim Kastens.

No Way the Great Barrier Reef Was Bleached Naturally - Washington Post

Featured News - Fri, 04/29/2016 - 11:20
Climate change dramatically upped the odds of severe coral bleaching of the Great Barrier Reef, researchers say. Lamont's Adam Sobel discussed the findings with the Washington Post.

Video: Peter deMenocal on Why Climate Matters - Talks@Columbia

Featured News - Thu, 04/28/2016 - 17:22
Climate change is one of the most complex and difficult challenges facing the world, and one of the most divisive. In this video, Lamont's Peter deMenocal discusses how climate is changing today and why.

Robotic Laboratories Fan Out to Study the Seas - Scientific American

Featured News - Thu, 04/28/2016 - 17:15
"They look like R2-D2 in swim floaties, but they could revolutionize ocean science." Lamont's Kyle Frischkorn writes about the new wave of marine robots.

Killer Landslides: The Lasting Legacy of the Nepal Earthquake - Scientific American

Featured News - Mon, 04/25/2016 - 12:33
A year after a devastating earthquake triggered killer avalanches and rock falls in Nepal, scientists are wiring up mountainsides to forecast hazards. Scientific American talks with Lamont's Colin Stark.

New York City Leads Investment in Climate Change Preparation - CCTV

Featured News - Fri, 04/22/2016 - 12:00
CCTV talked with Lamont's Klaus Jacob about how New York City is bracing for the effects of climate change and whether it is doing enough.

What Is the Climate Innovation Gap? - PBS SciTech Now

Featured News - Thu, 04/21/2016 - 12:00
Over the last decade, federal spending on research and development as a percentage of our country’s GDP has been declining. PBS SciTech Now talks with Lamont's Peter deMenocal.

Fate of World's Coasts Rests on Melting Ice - Scientific American

Featured News - Thu, 04/21/2016 - 12:00
Lamont's Maureen Raymo talks about the value of determining the heights of prehistoric shorelines for projecting future sea level rise.

This Is How Surfers Are Helping Fund Climate Science - Climate Central

Featured News - Wed, 04/20/2016 - 12:00
The World Surf League has created a unique partnership with climate scientists at Lamont that could help the sport, the ocean and spur a new research model.

The Mad Dash to Figure Out the Fate of Peatlands - Smithsonian Magazine

Featured News - Wed, 04/20/2016 - 12:00
As the planet’s peat swamps come under threat, the destiny of their stored carbon remains a mystery. Lamont's Jonathan Nichols takes the Smithsonian on a tour of the challenge.

Ice a Surprising Heat Source on Jupiter's Europa - Cosmos Magazine

Featured News - Mon, 04/18/2016 - 16:44
Constant gravitational pressures on the icy surface of Jupiter’s moon Europa generate much more heat than previously thought, which may force a rethink about the chemistry of the liquid water ocean below the surface, says Lamont's Christine McCarthy.

Di che colore è la Groenlandia? - La Repubblica

Featured News - Sun, 04/17/2016 - 12:00
In a column appearing in Italy's La Repubblica, Lamont's Marco Tedesco discusses the darkening of Greenland and how that contributes to a cycle of melting. The column is written in Italian.

What Loss of Snowpack Means for Water Supplies - The Desert Sun

Featured News - Thu, 04/14/2016 - 10:30
Global warming will require big changes in how we management water, the Desert Sun writes. “In general, what a measure like this is telling us is that our historical reliance on snow is untenable in a future climate," said Lamont's Justin Mankin.

It’s April, and Scientists Are Already Stunned by Greenland’s Melting - Washington Post

Featured News - Wed, 04/13/2016 - 09:11
The vast Greenland ice sheet is seeing a record-breaking level of melt for so early in the season. “The potential implications, in terms of runoff and so on, they alter the memory of the snowpack, the potential implications can be big either for the same season or future seasons,” said Lamont's Marco Tedesco.

Another victim of science funding

Chasing Microbes in Antarctica - Tue, 04/12/2016 - 15:27

Or rather the lack thereof.  I was very disappointed to receive an email yesterday that BioCyc, a popular database of enzymes and metabolic pathways in model organisms, is moving to a subscription model.  The email is posted below in its entirety (to assist in the request to spread the word).  With this move BioCyc cements the pattern established several years ago by the Koyoto Encyclopedia of Genes and Genomes (KEGG).  I don’t begrudge the architects of these databases for moving behind a paywall; as the BioCyc principal investigator Peter Karp notes below, developing and maintaining a high-quality biological database is time and resource intensive, and skilled curators are expensive.  I do think however, that moving these resources to a subscription-based model does a disservice to the research community and is not in the public interest.  While it was not the responsibility of US funding agencies to ensure the long-term viability and accessibility of KEGG, BioCyc is in their court.  In my opinion the failure of NSF and NIH to adequately support community resources amounts to a dereliction of duty.

As noted in the letter below one of the challenges faced by database designers and curators is the tradition of peer review.  At its best the peer review process is an effective arbitrator of research spending.  I think that the peer review process is at its best only under certain funding regimes however, and suspect that the current low rates of federal funding for science do not allow for fair and unbiased peer review.  This is particularly the case for databases and projects whose return on investment is not evident in the short term in one or two high-impact publications, but in the supporting role played in dozens or hundreds or thousands of studies across the community over many years.  Program officers, managers, and directors need to be cognizant of the limitations of the peer review process, and not shy away from some strategic thinking every now and again.

My first experience with BioCyc came in my first year of graduate school, when I was tentatively grappling with the relationship between gene and genome functions and the ecology of cold-adapted microbes.  Like many academic labs around the country the lab I was in was chronically short on funds, and an academic license for expensive software (e.g. CLC Workbench) or a database subscription (á la KEGG or BioCyc – both were thankfully free back in those days) would have been out of the question.  Without these tools I simply would have had nowhere to start my exploration.  I fear that the subscription model creates intellectual barriers, potentially partitioning good ideas (which might arise anywhere) from the tools required to develop them (which will only be found in well-funded or specialist labs).

Viva community science!


Dear Colleague,

I am writing to request your support as we begin a new chapter in the development of BioCyc.

In short, we plan to upgrade the curation level and quality of many BioCyc databases to provide you with higher quality information resources for many important microbes, and for Homo sapiens.  Such an effort requires large financial resources that — despite numerous attempts over numerous years — have not been forthcoming from government funding agencies.  Thus, we plan to transition BioCyc to a community-supported non-profit subscription model in the coming months.

Our Goal

Our goal at BioCyc is to provide you with the highest quality microbial genome and metabolic pathway web portal in the world by coupling unique and high-quality database content with powerful and user-friendly bioinformatics tools.

Our work on EcoCyc has demonstrated the way forward.  EcoCyc is an incredibly rich and detailed information resource whose contents have been derived from 30,000 E. coli publications.  EcoCyc is an online electronic encyclopedia, a highly structured queryable database, a bioinformatics platform for omics data analysis, and an executable metabolic model.  EcoCyc is highly used by the life-sciences community, demonstrating the need and value of such a resource.

Our goal is to develop similar high-quality databases for other organisms.  BioCyc now contains 7,600 databases, but only 42 of them have undergone any literature-based curation, and that occurs irregularly.  Although bioinformatics algorithms have undergone amazing advances in the past two decades, their accuracy is still limited, and no bioinformatics inference algorithms exist for many types of biological information.  The experimental literature contains vast troves of valuable information, and despite advances in text mining algorithms, curation by experienced biologists is the only way to accurately extract that information.

EcoCyc curators extract a wide range of information on protein function; on metabolic pathways; and on regulation at the transcriptional, translational, and post-translational levels.

In the past year SRI has performed significant curation on the BioCyc databases for Saccharomyces cerevisiae, Bacillus subtilis, Mycobacterium tuberculosis, Clostridium difficile, and (to be released shortly) Corynebacterium glutamicum. All told, BioCyc databases have been curated from 66,000 publications, and constitute a unique resource in the microbial informatics landscape.  Yet much more information remains untapped in the biomedical literature, and new information is published at a rapid pace.  That information can be extracted only by professional curators who understand both the biology, and the methods for encoding that biology in structured databases.  Without adequate financial resources, we cannot hire these curators, whose efforts are needed on an ongoing basis.

Why Do We Seek Financial Support from the Scientific Community?

The EcoCyc project has been fortunate to receive government funding for its development since 1992.  Similar government-supported databases exist for a handful of biomedical model organisms, such as fly, yeast, worm, and zebrafish.

Peter Karp has been advocating that the government fund similar efforts for other important microbes for the past twenty years, such as for pathogens, biotechnology workhorses, model organisms, and synthetic-biology chassis for biofuels development.  He has developed the Pathway Tools software as a software platform to enable the development of curated EcoCyc-like databases for other organisms, and the software has been used by many groups. However, not only has government support for databases not kept pace with the relentless increases in experimental data generation, but the government is funding few new databases, is cutting funding for some existing databases (such as for EcoCyc, for BioCyc, and for TAIR), and is encouraging the development of other funding models for supporting databases [1].  Funding for BioCyc was cut by 27% at our last renewal whereas we are managing five times the number of genomes as five years ago. We also find that even when government agencies want to support databases, review panels score database proposals with low enthusiasm and misunderstanding, despite the obvious demand for high-quality databases by the scientific community.

Put another way: the Haemophilus influenzae genome was sequenced in 1995.  Now, twenty years later, no curated database that is updated on an ongoing basis exists for this important human pathogen.  Mycobacterium tuberculosis was sequenced in 1998, and now, eighteen years later, no comprehensive curated database exists for the genes, metabolism, and regulatory network of this killer of 1.5 million human beings per year.  No curated database exists for the important gram-positive model organism Bacillus subtilis.  How much longer shall we wait for modern resources that integrate the titanic amounts of information available about critical microbes with powerful bioinformatics tools to turbocharge life-science research?

How it Will Work and How You Can Support BioCyc

The tradition whereby scientific journals receive financial support from scientists in the form of subscriptions is a long one.  We are now turning to a similar model to support the curation and operation of BioCyc.  We seek individual and institutional subscriptions from those who receive the most value from BioCyc, and who are best positioned to direct its future evolution.  We have developed a subscription-pricing model that is on par with journal pricing, although we find that many of our users consult BioCyc on a daily basis — more frequently than they consult most journals.

We hope that this subscription model will allow us to raise more funds, more sustainably, than is possible through government grants, through our wide user base in academic, corporate, and government institutions around the world.  We will also be exploring other possible revenue sources, and additional ways of partnering with the scientific community.

BioCyc is collaborating with Phoenix Bioinformatics to develop our community-supported subscription model.  Phoenix is a nonprofit that already manages community financial support for the TAIR Arabidopsis database, which was previously funded by the NSF and is now fully supported [2] by users.

Phoenix Bioinformatics  will collect BioCyc subscriptions on behalf of SRI International, which like Phoenix is a non-profit institution. Subscription revenues will be invested into curation, operation, and marketing of the BioCyc resource.

We plan to go slow with this transition to give our users time to adapt.  WeÕll begin requiring subscriptions for access to BioCyc databases other than EcoCyc and MetaCyc starting in July 2016.

Access to the EcoCyc and MetaCyc databases will remain free for now.

Subscriptions to the other 7,600 BioCyc databases will be available to institutions (e.g., libraries), and to individuals.  One subscription will grant access to all of BioCyc.  To encourage your institutional library to sign up, please contact your science librarian and let him or her know that continued access to BioCyc is important for your research and/or teaching.

Subscription prices will be based on website usage levels and we hope to keep them affordable so that everyone who needs these databases will still be able to access them.  We are finalizing the academic library and individual prices and will follow up soon with more information including details on how to sign up. We will make provisions to ensure that underprivileged scientists and students in third-world countries arenÕt locked out.

Please spread the word to your colleagues — the more groups who subscribe, the better quality resource we can build for the scientific community.

Peter Karp

Director, SRI Bioinformatics Research Group



This Is How Far Sea Level Could Rise Thanks to Climate Change - BBC

Featured News - Mon, 04/11/2016 - 15:45
If climate change continues, we can expect a large rise in sea level this century, and it will only get worse in the centuries to come. The BBC talks with Lamont's Maureen Raymo.



Subscribe to Lamont-Doherty Earth Observatory aggregator