Or rather the lack thereof. I was very disappointed to receive an email yesterday that BioCyc, a popular database of enzymes and metabolic pathways in model organisms, is moving to a subscription model. The email is posted below in its entirety (to assist in the request to spread the word). With this move BioCyc cements the pattern established several years ago by the Koyoto Encyclopedia of Genes and Genomes (KEGG). I don’t begrudge the architects of these databases for moving behind a paywall; as the BioCyc principal investigator Peter Karp notes below, developing and maintaining a high-quality biological database is time and resource intensive, and skilled curators are expensive. I do think however, that moving these resources to a subscription-based model does a disservice to the research community and is not in the public interest. While it was not the responsibility of US funding agencies to ensure the long-term viability and accessibility of KEGG, BioCyc is in their court. In my opinion the failure of NSF and NIH to adequately support community resources amounts to a dereliction of duty.
As noted in the letter below one of the challenges faced by database designers and curators is the tradition of peer review. At its best the peer review process is an effective arbitrator of research spending. I think that the peer review process is at its best only under certain funding regimes however, and suspect that the current low rates of federal funding for science do not allow for fair and unbiased peer review. This is particularly the case for databases and projects whose return on investment is not evident in the short term in one or two high-impact publications, but in the supporting role played in dozens or hundreds or thousands of studies across the community over many years. Program officers, managers, and directors need to be cognizant of the limitations of the peer review process, and not shy away from some strategic thinking every now and again.
My first experience with BioCyc came in my first year of graduate school, when I was tentatively grappling with the relationship between gene and genome functions and the ecology of cold-adapted microbes. Like many academic labs around the country the lab I was in was chronically short on funds, and an academic license for expensive software (e.g. CLC Workbench) or a database subscription (á la KEGG or BioCyc – both were thankfully free back in those days) would have been out of the question. Without these tools I simply would have had nowhere to start my exploration. I fear that the subscription model creates intellectual barriers, potentially partitioning good ideas (which might arise anywhere) from the tools required to develop them (which will only be found in well-funded or specialist labs).
Viva community science!
I am writing to request your support as we begin a new chapter in the development of BioCyc.
In short, we plan to upgrade the curation level and quality of many BioCyc databases to provide you with higher quality information resources for many important microbes, and for Homo sapiens. Such an effort requires large financial resources that — despite numerous attempts over numerous years — have not been forthcoming from government funding agencies. Thus, we plan to transition BioCyc to a community-supported non-profit subscription model in the coming months.
Our goal at BioCyc is to provide you with the highest quality microbial genome and metabolic pathway web portal in the world by coupling unique and high-quality database content with powerful and user-friendly bioinformatics tools.
Our work on EcoCyc has demonstrated the way forward. EcoCyc is an incredibly rich and detailed information resource whose contents have been derived from 30,000 E. coli publications. EcoCyc is an online electronic encyclopedia, a highly structured queryable database, a bioinformatics platform for omics data analysis, and an executable metabolic model. EcoCyc is highly used by the life-sciences community, demonstrating the need and value of such a resource.
Our goal is to develop similar high-quality databases for other organisms. BioCyc now contains 7,600 databases, but only 42 of them have undergone any literature-based curation, and that occurs irregularly. Although bioinformatics algorithms have undergone amazing advances in the past two decades, their accuracy is still limited, and no bioinformatics inference algorithms exist for many types of biological information. The experimental literature contains vast troves of valuable information, and despite advances in text mining algorithms, curation by experienced biologists is the only way to accurately extract that information.
EcoCyc curators extract a wide range of information on protein function; on metabolic pathways; and on regulation at the transcriptional, translational, and post-translational levels.
In the past year SRI has performed significant curation on the BioCyc databases for Saccharomyces cerevisiae, Bacillus subtilis, Mycobacterium tuberculosis, Clostridium difficile, and (to be released shortly) Corynebacterium glutamicum. All told, BioCyc databases have been curated from 66,000 publications, and constitute a unique resource in the microbial informatics landscape. Yet much more information remains untapped in the biomedical literature, and new information is published at a rapid pace. That information can be extracted only by professional curators who understand both the biology, and the methods for encoding that biology in structured databases. Without adequate financial resources, we cannot hire these curators, whose efforts are needed on an ongoing basis.
Why Do We Seek Financial Support from the Scientific Community?
The EcoCyc project has been fortunate to receive government funding for its development since 1992. Similar government-supported databases exist for a handful of biomedical model organisms, such as fly, yeast, worm, and zebrafish.
Peter Karp has been advocating that the government fund similar efforts for other important microbes for the past twenty years, such as for pathogens, biotechnology workhorses, model organisms, and synthetic-biology chassis for biofuels development. He has developed the Pathway Tools software as a software platform to enable the development of curated EcoCyc-like databases for other organisms, and the software has been used by many groups. However, not only has government support for databases not kept pace with the relentless increases in experimental data generation, but the government is funding few new databases, is cutting funding for some existing databases (such as for EcoCyc, for BioCyc, and for TAIR), and is encouraging the development of other funding models for supporting databases . Funding for BioCyc was cut by 27% at our last renewal whereas we are managing five times the number of genomes as five years ago. We also find that even when government agencies want to support databases, review panels score database proposals with low enthusiasm and misunderstanding, despite the obvious demand for high-quality databases by the scientific community.
Put another way: the Haemophilus influenzae genome was sequenced in 1995. Now, twenty years later, no curated database that is updated on an ongoing basis exists for this important human pathogen. Mycobacterium tuberculosis was sequenced in 1998, and now, eighteen years later, no comprehensive curated database exists for the genes, metabolism, and regulatory network of this killer of 1.5 million human beings per year. No curated database exists for the important gram-positive model organism Bacillus subtilis. How much longer shall we wait for modern resources that integrate the titanic amounts of information available about critical microbes with powerful bioinformatics tools to turbocharge life-science research?
How it Will Work and How You Can Support BioCyc
The tradition whereby scientific journals receive financial support from scientists in the form of subscriptions is a long one. We are now turning to a similar model to support the curation and operation of BioCyc. We seek individual and institutional subscriptions from those who receive the most value from BioCyc, and who are best positioned to direct its future evolution. We have developed a subscription-pricing model that is on par with journal pricing, although we find that many of our users consult BioCyc on a daily basis — more frequently than they consult most journals.
We hope that this subscription model will allow us to raise more funds, more sustainably, than is possible through government grants, through our wide user base in academic, corporate, and government institutions around the world. We will also be exploring other possible revenue sources, and additional ways of partnering with the scientific community.
BioCyc is collaborating with Phoenix Bioinformatics to develop our community-supported subscription model. Phoenix is a nonprofit that already manages community financial support for the TAIR Arabidopsis database, which was previously funded by the NSF and is now fully supported  by users.
Phoenix Bioinformatics will collect BioCyc subscriptions on behalf of SRI International, which like Phoenix is a non-profit institution. Subscription revenues will be invested into curation, operation, and marketing of the BioCyc resource.
We plan to go slow with this transition to give our users time to adapt. WeÕll begin requiring subscriptions for access to BioCyc databases other than EcoCyc and MetaCyc starting in July 2016.
Access to the EcoCyc and MetaCyc databases will remain free for now.
Subscriptions to the other 7,600 BioCyc databases will be available to institutions (e.g., libraries), and to individuals. One subscription will grant access to all of BioCyc. To encourage your institutional library to sign up, please contact your science librarian and let him or her know that continued access to BioCyc is important for your research and/or teaching.
Subscription prices will be based on website usage levels and we hope to keep them affordable so that everyone who needs these databases will still be able to access them. We are finalizing the academic library and individual prices and will follow up soon with more information including details on how to sign up. We will make provisions to ensure that underprivileged scientists and students in third-world countries arenÕt locked out.
Please spread the word to your colleagues — the more groups who subscribe, the better quality resource we can build for the scientific community.
Director, SRI Bioinformatics Research Group
This tutorial is both a work in progress and a living document. If you see an error, or want something added, please let me know by leaving a comment.
Starting with version 3.0.0 paprica contains a metagenomic annotation module. This module takes as input a fasta or fasta.gz file containing the QC’d reads from a shotgun metagenome and uses DIAMOND Blastx to classify these reads against a database derived from the paprica database. The module produces as output:
- Classification for each read in the form of an EC number (obviously this applies only to reads associated with genes coding for enzymes).
- A tally of the occurrence of each EC number in the sample, with some useful supporting information.
- Optionally, the metabolic pathways likely to be present within the sample.
In addition to the normal paprica-run.sh dependencies paprica-mg requires DIAMOND Blastx. Follow the instructions in the DIAMOND manual, and be sure to add the location of the DIAMOND executables to your PATH. If you want to predict metabolic pathways on your metagenome you will need to also download the pathway-tools software. See the notes here.
There are two ways to obtain the paprica-mg database. You can obtain a pre-made version of the database by downloading the files paprica-mg.dmnd and paprica-mg.ec.csv.gz (large!) to the ref_genome_database directory. Be sure to gunzip paprica-mg.ec.csv.gz before continuing. If you wish to build the paprica-mg database from scratch, perhaps because you’ve customized that database or are building it more frequently than the release cycle, you will need to first build the regular paprica database. Then build the paprica-mg database as such:paprica-mg_build.py -ref_dir ref_genome_database
If you’ve set paprica up in the standard way you can be execute this command from anywhere on your system; the paprica directory is already in your PATH, and the script will look for the directory “ref_genome_database” relative to itself. Likewise you don’t need to be in the paprica directory to execute the paprica-mg_run.py script.
Once you’ve downloaded or built the database you can run your analysis. It is worth spending a little time with the DIAMOND manual and considering the parameters of your system. To try things out you can download a “smallish” QC’d metagenome from the Tara Oceans Expedition (selected randomly for convenient size):## download a test metagenome wget http://www.polarmicrobes.org/extras/ERR318619_1.qc.fasta.gz ## execute paprica-mg for EC annotation only paprica-mg_run.py -i ERR318619_1.qc.fasta.gz -o test -ref_dir ref_genome_database -pathways F
This will produce the following output:
test.annotation.csv: The number of hits in the metagenome, by EC number. See the paprica manual for a complete explanation of columns.
test.paprica-mg.nr.daa: The DIAMOND format results file. Only one hit per read is reported.
test.paprica-mg.nr.txt: A text file of the DIAMOND results. Only one hit per read is reported.
Predicting pathways on a metagenome is very time intensive and it isn’t clear what the “correct” way is to do this. I’ve tried to balance speed with accuracy in paprica-mg. If you execute with -pathways T, DIAMOND is executed twice; once for the EC number annotation as above (reporting only a single hit for each read), and once to collect data for pathway prediction. On that search DIAMOND reports as many hits for each read as there are genomes in the paprica database. Of course most reads will have far fewer (if any) hits. The reason for this approach is to try and reconstruct as best as possible the actual genomes that are present. For example, let’s say that a given read has a significant hit to an enzyme in genome A and genome B. When aggregating information for pathway-tools the enzyme in genome A and genome B will be presented to pathway-tools in separate Genbank files representing separate genetic elements. Because a missing enzyme in either genome A or genome B could cause a negative prediction for the pathway, we want the best chance of capturing the whole pathway. So a second enzyme, critical to the prediction of that pathway, might get predicted for only genome A or genome B. The idea is that the incomplete pathways will get washed out at the end of the analysis, and since pathway prediction is by its nature non-redundant (each pathway can only be predicted once) over-prediction is minimized. To predict pathways during annotation:## execute paprica-mg for EC annotation and pathway prediction paprica-mg_run.py -i ERR318619_1.qc.fasta.gz -o test -ref_dir ref_genome_database -pathways T -pgdb_dir /location/of/ptools-local
In addition to the files already mentioned, you will see:
test_mg.pathologic: a directory containing all the information that pathway-tools needs for pathway prediction.
test.pathways.txt: A simple text file of all the pathways that were predicted.
test.paprica-mg.txt: A very large text file of all hits for each read. You probably want to delete this right away to save space.
test.paprica-mg.daa: A very large DIAMOND results file of all hits for each read. You probably want to delete this right away to save space.
testcyc: A directory in ptools-local/pgdbs/local containing the PGDB and prediction reports. It is worth spending some time here, and interacting with the PGDB using the pathway-tools GUI.