Disclaimer: I banged this out fast from existing scripts to help some folks, but haven’t tested it yet. Will do that shortly, in the meantime, be careful!
I use phylogenetic placement, namely the program pplacer, in a lot of my publications. It is also a core part of of the paprica metabolic inference pipeline. As a result I field a lot questions from people trying to integrate pplacer into their own workflows. Although the Matsen group has done an excellent job with documentation for pplacer, guppy, and taxtastic, the three programs you need to work with to do phylogenetic placement from start to finish (see also EPA), there is still a steep learning curve for new users. In the hope of bringing the angle of that curve down a notch or two, and updating my previous posts on the subject (here and here), here is a complete, start to finish example of phylogenetic placement, using 16S rRNA gene sequences corresponding to the new tree of life recently published by Hug et al. To follow along with the tutorial start by downloading the sequences here.
You can use any number of alignment and tree building programs to create a reference tree for phylogenetic placement. I strongly recommend using RAxML and Infernal. After a lot of experimentation this combination seems to be produce the most correct topologies and best supported trees. You should be aware that no 16S rRNA gene tree (or any other tree) is absolutely “correct” for domain-level let alone life-level analyses, but nothing in life is perfect. While you’re installing software I also recommend the excellent utility Seqmagick. Finally, you will need a covariance model of the 16S rRNA gene to feed into Infernal. You can find that at the Rfam database here.
The workflow will follow these steps:
- Create an alignment of the reference sequences with Infernal
- Create a phylogenetic tree of the alignment
- Create a reference package from the alignment, tree, and stats file
- Proceed with the phylogenetic placement of your query reads
Create an alignment of the reference sequences
The very first thing that you need to do is clean your sequence names of any wonky punctuation. This is something that trips up almost everyone. You should really have no punctuation in the names beyond “_”, and no spaces!tr " -" "_" < hug_tol.fasta | tr -d "%\,;():=.\\*\"\'" > hug_tol.clean.fasta
Next create an alignment from the cleaned file. I always like to degap first, although it shouldn’t matter.## Degap seqmagick mogrify --ungap hug_tol.clean.fasta ## Align cmalign --dna -o hug_tol.clean.align.sto --outformat Pfam 16S_bacteria.cm hug_tol.clean.fasta ## Convert to fasta format seqmagick convert hug_tol.clean.align.sto hug_tol.clean.align.fasta
Build the reference tree
At this point you should have a nice clean DNA alignment in the fasta format. Now feed it to RAxML to build a tree. Depending on the size of the alignment this can take a little bit. I know you’ve read the entire RAxML manual so of course you are already aware that adding additional cpus won’t help…raxmlHPC-PTHREADS-AVX2 -T 8 -m GTRGAMMA -s hug_tol.clean.align.fasta -n ref.tre -f d -p 12345
I like having a sensibly rooted tree; it’s just more pleasing to look at. You can do this manually, or you can have RAxML try to root the tree for you.raxmlHPC-PTHREADS-AVX2 -T 2 -m GTRGAMMA -f I -t RAxML_bestTree.ref.tre -n root.ref.tre
Okay, now comes the tricky bit. Clearly you’d like to have some support values on your reference tree, but the Taxtastic program that we will use to build the reference tree won’t be able to read the RAxML stats file if it includes confidence values. The work around is to build a second tree with confidence values. You will feed this tree to Taxtastic with the stats file from the tree we already generated.## Generate confidence scores for tree raxmlHPC-PTHREADS-AVX2 -T 8 -m GTRGAMMA -f J -p 12345 -t RAxML_rootedTree.root.ref.tre -n conf.root.ref.tre -s hug_tol.clean.align.fasta
Now we can use the alignment, the rooted tree with confidence scores, and the stats file without confidence scores to create our reference package.taxit create -l 16S_rRNA -P hug_tol.refpkg --aln-fasta hug_tol.clean.align.fasta --tree-stats RAxML_info.ref.tre --tree-file RAxML_fastTreeSH_Support.conf.root.ref.tre
Align the query reads
At this point you have the reference package and you can proceed with analyzing some query reads! The first step is to align the query reads in exactly the same fashion as the reference sequences. This is important as the alignments will be merged later.## Clean the names tr " -" "_" < query.fasta | tr -d "%\,;():=.\\*\"\'" > query.clean.fasta ## Remove any gaps seqmagick mogrify --ungap query.clean.fast ## Align cmalign --dna -o query.clean.align.sto --outformat Pfam 16S_bacteria.cm query.clean.fasta
Now we use the esl-alimerge command, included with Infernal, to merge the query and reference alignments.## Merge alignments esl-alimerge --outformat pfam --dna -o query.hug_tol.clean.align.sto query.clean.align.sto hug_tol.refpkg/hug_tol.clean.align.sto ## Convert to fasta seqmagick convert query.hug_tol.clean.align.sto query.hug_tol.clean.align.fast
Now we’re on the home stretch, we can execute the phylogenetic placement itself! The flags are important here, so it’s worth checking the pplacer documentation to insure that your goals are consistent with mine (get a job, publish some papers?). You can probably accept most of the flags for the previous commands as is.pplacer -o query.hug_tol.clean.align.jplace -p --keep-at-most 20 -c hug_tol.refpkg query.hug_tol.clean.align.fasta
At this point you have a file named query.hug_tol.clean.align.jplace. You will need to use guppy to convert this json-format file to information that is readable by human. The two most useful guppy commands (in my experience) for a basic look at your data are:## Generate an easily parsed csv file of placements, with only a single placement reported for each ## query read. guppy to_csv --point-mass --pp -o query.hug_tol.clean.align.csv query.hug_tol.clean.align.jplace ## Generate a phyloxml tree with edges fattened according to the number of placements. guppy fat --node-numbers --point-mass --pp -o query.hug_tol.clean.align.phyloxml query.hug_tol.clean.align.jplace
Or rather the lack thereof. I was very disappointed to receive an email yesterday that BioCyc, a popular database of enzymes and metabolic pathways in model organisms, is moving to a subscription model. The email is posted below in its entirety (to assist in the request to spread the word). With this move BioCyc cements the pattern established several years ago by the Koyoto Encyclopedia of Genes and Genomes (KEGG). I don’t begrudge the architects of these databases for moving behind a paywall; as the BioCyc principal investigator Peter Karp notes below, developing and maintaining a high-quality biological database is time and resource intensive, and skilled curators are expensive. I do think however, that moving these resources to a subscription-based model does a disservice to the research community and is not in the public interest. While it was not the responsibility of US funding agencies to ensure the long-term viability and accessibility of KEGG, BioCyc is in their court. In my opinion the failure of NSF and NIH to adequately support community resources amounts to a dereliction of duty.
As noted in the letter below one of the challenges faced by database designers and curators is the tradition of peer review. At its best the peer review process is an effective arbitrator of research spending. I think that the peer review process is at its best only under certain funding regimes however, and suspect that the current low rates of federal funding for science do not allow for fair and unbiased peer review. This is particularly the case for databases and projects whose return on investment is not evident in the short term in one or two high-impact publications, but in the supporting role played in dozens or hundreds or thousands of studies across the community over many years. Program officers, managers, and directors need to be cognizant of the limitations of the peer review process, and not shy away from some strategic thinking every now and again.
My first experience with BioCyc came in my first year of graduate school, when I was tentatively grappling with the relationship between gene and genome functions and the ecology of cold-adapted microbes. Like many academic labs around the country the lab I was in was chronically short on funds, and an academic license for expensive software (e.g. CLC Workbench) or a database subscription (á la KEGG or BioCyc – both were thankfully free back in those days) would have been out of the question. Without these tools I simply would have had nowhere to start my exploration. I fear that the subscription model creates intellectual barriers, potentially partitioning good ideas (which might arise anywhere) from the tools required to develop them (which will only be found in well-funded or specialist labs).
Viva community science!
I am writing to request your support as we begin a new chapter in the development of BioCyc.
In short, we plan to upgrade the curation level and quality of many BioCyc databases to provide you with higher quality information resources for many important microbes, and for Homo sapiens. Such an effort requires large financial resources that — despite numerous attempts over numerous years — have not been forthcoming from government funding agencies. Thus, we plan to transition BioCyc to a community-supported non-profit subscription model in the coming months.
Our goal at BioCyc is to provide you with the highest quality microbial genome and metabolic pathway web portal in the world by coupling unique and high-quality database content with powerful and user-friendly bioinformatics tools.
Our work on EcoCyc has demonstrated the way forward. EcoCyc is an incredibly rich and detailed information resource whose contents have been derived from 30,000 E. coli publications. EcoCyc is an online electronic encyclopedia, a highly structured queryable database, a bioinformatics platform for omics data analysis, and an executable metabolic model. EcoCyc is highly used by the life-sciences community, demonstrating the need and value of such a resource.
Our goal is to develop similar high-quality databases for other organisms. BioCyc now contains 7,600 databases, but only 42 of them have undergone any literature-based curation, and that occurs irregularly. Although bioinformatics algorithms have undergone amazing advances in the past two decades, their accuracy is still limited, and no bioinformatics inference algorithms exist for many types of biological information. The experimental literature contains vast troves of valuable information, and despite advances in text mining algorithms, curation by experienced biologists is the only way to accurately extract that information.
EcoCyc curators extract a wide range of information on protein function; on metabolic pathways; and on regulation at the transcriptional, translational, and post-translational levels.
In the past year SRI has performed significant curation on the BioCyc databases for Saccharomyces cerevisiae, Bacillus subtilis, Mycobacterium tuberculosis, Clostridium difficile, and (to be released shortly) Corynebacterium glutamicum. All told, BioCyc databases have been curated from 66,000 publications, and constitute a unique resource in the microbial informatics landscape. Yet much more information remains untapped in the biomedical literature, and new information is published at a rapid pace. That information can be extracted only by professional curators who understand both the biology, and the methods for encoding that biology in structured databases. Without adequate financial resources, we cannot hire these curators, whose efforts are needed on an ongoing basis.
Why Do We Seek Financial Support from the Scientific Community?
The EcoCyc project has been fortunate to receive government funding for its development since 1992. Similar government-supported databases exist for a handful of biomedical model organisms, such as fly, yeast, worm, and zebrafish.
Peter Karp has been advocating that the government fund similar efforts for other important microbes for the past twenty years, such as for pathogens, biotechnology workhorses, model organisms, and synthetic-biology chassis for biofuels development. He has developed the Pathway Tools software as a software platform to enable the development of curated EcoCyc-like databases for other organisms, and the software has been used by many groups. However, not only has government support for databases not kept pace with the relentless increases in experimental data generation, but the government is funding few new databases, is cutting funding for some existing databases (such as for EcoCyc, for BioCyc, and for TAIR), and is encouraging the development of other funding models for supporting databases . Funding for BioCyc was cut by 27% at our last renewal whereas we are managing five times the number of genomes as five years ago. We also find that even when government agencies want to support databases, review panels score database proposals with low enthusiasm and misunderstanding, despite the obvious demand for high-quality databases by the scientific community.
Put another way: the Haemophilus influenzae genome was sequenced in 1995. Now, twenty years later, no curated database that is updated on an ongoing basis exists for this important human pathogen. Mycobacterium tuberculosis was sequenced in 1998, and now, eighteen years later, no comprehensive curated database exists for the genes, metabolism, and regulatory network of this killer of 1.5 million human beings per year. No curated database exists for the important gram-positive model organism Bacillus subtilis. How much longer shall we wait for modern resources that integrate the titanic amounts of information available about critical microbes with powerful bioinformatics tools to turbocharge life-science research?
How it Will Work and How You Can Support BioCyc
The tradition whereby scientific journals receive financial support from scientists in the form of subscriptions is a long one. We are now turning to a similar model to support the curation and operation of BioCyc. We seek individual and institutional subscriptions from those who receive the most value from BioCyc, and who are best positioned to direct its future evolution. We have developed a subscription-pricing model that is on par with journal pricing, although we find that many of our users consult BioCyc on a daily basis — more frequently than they consult most journals.
We hope that this subscription model will allow us to raise more funds, more sustainably, than is possible through government grants, through our wide user base in academic, corporate, and government institutions around the world. We will also be exploring other possible revenue sources, and additional ways of partnering with the scientific community.
BioCyc is collaborating with Phoenix Bioinformatics to develop our community-supported subscription model. Phoenix is a nonprofit that already manages community financial support for the TAIR Arabidopsis database, which was previously funded by the NSF and is now fully supported  by users.
Phoenix Bioinformatics will collect BioCyc subscriptions on behalf of SRI International, which like Phoenix is a non-profit institution. Subscription revenues will be invested into curation, operation, and marketing of the BioCyc resource.
We plan to go slow with this transition to give our users time to adapt. WeÕll begin requiring subscriptions for access to BioCyc databases other than EcoCyc and MetaCyc starting in July 2016.
Access to the EcoCyc and MetaCyc databases will remain free for now.
Subscriptions to the other 7,600 BioCyc databases will be available to institutions (e.g., libraries), and to individuals. One subscription will grant access to all of BioCyc. To encourage your institutional library to sign up, please contact your science librarian and let him or her know that continued access to BioCyc is important for your research and/or teaching.
Subscription prices will be based on website usage levels and we hope to keep them affordable so that everyone who needs these databases will still be able to access them. We are finalizing the academic library and individual prices and will follow up soon with more information including details on how to sign up. We will make provisions to ensure that underprivileged scientists and students in third-world countries arenÕt locked out.
Please spread the word to your colleagues — the more groups who subscribe, the better quality resource we can build for the scientific community.
Director, SRI Bioinformatics Research Group