Or rather the lack thereof. I was very disappointed to receive an email yesterday that BioCyc, a popular database of enzymes and metabolic pathways in model organisms, is moving to a subscription model. The email is posted below in its entirety (to assist in the request to spread the word). With this move BioCyc cements the pattern established several years ago by the Koyoto Encyclopedia of Genes and Genomes (KEGG). I don’t begrudge the architects of these databases for moving behind a paywall; as the BioCyc principal investigator Peter Karp notes below, developing and maintaining a high-quality biological database is time and resource intensive, and skilled curators are expensive. I do think however, that moving these resources to a subscription-based model does a disservice to the research community and is not in the public interest. While it was not the responsibility of US funding agencies to ensure the long-term viability and accessibility of KEGG, BioCyc is in their court. In my opinion the failure of NSF and NIH to adequately support community resources amounts to a dereliction of duty.
As noted in the letter below one of the challenges faced by database designers and curators is the tradition of peer review. At its best the peer review process is an effective arbitrator of research spending. I think that the peer review process is at its best only under certain funding regimes however, and suspect that the current low rates of federal funding for science do not allow for fair and unbiased peer review. This is particularly the case for databases and projects whose return on investment is not evident in the short term in one or two high-impact publications, but in the supporting role played in dozens or hundreds or thousands of studies across the community over many years. Program officers, managers, and directors need to be cognizant of the limitations of the peer review process, and not shy away from some strategic thinking every now and again.
My first experience with BioCyc came in my first year of graduate school, when I was tentatively grappling with the relationship between gene and genome functions and the ecology of cold-adapted microbes. Like many academic labs around the country the lab I was in was chronically short on funds, and an academic license for expensive software (e.g. CLC Workbench) or a database subscription (á la KEGG or BioCyc – both were thankfully free back in those days) would have been out of the question. Without these tools I simply would have had nowhere to start my exploration. I fear that the subscription model creates intellectual barriers, potentially partitioning good ideas (which might arise anywhere) from the tools required to develop them (which will only be found in well-funded or specialist labs).
Viva community science!
I am writing to request your support as we begin a new chapter in the development of BioCyc.
In short, we plan to upgrade the curation level and quality of many BioCyc databases to provide you with higher quality information resources for many important microbes, and for Homo sapiens. Such an effort requires large financial resources that — despite numerous attempts over numerous years — have not been forthcoming from government funding agencies. Thus, we plan to transition BioCyc to a community-supported non-profit subscription model in the coming months.
Our goal at BioCyc is to provide you with the highest quality microbial genome and metabolic pathway web portal in the world by coupling unique and high-quality database content with powerful and user-friendly bioinformatics tools.
Our work on EcoCyc has demonstrated the way forward. EcoCyc is an incredibly rich and detailed information resource whose contents have been derived from 30,000 E. coli publications. EcoCyc is an online electronic encyclopedia, a highly structured queryable database, a bioinformatics platform for omics data analysis, and an executable metabolic model. EcoCyc is highly used by the life-sciences community, demonstrating the need and value of such a resource.
Our goal is to develop similar high-quality databases for other organisms. BioCyc now contains 7,600 databases, but only 42 of them have undergone any literature-based curation, and that occurs irregularly. Although bioinformatics algorithms have undergone amazing advances in the past two decades, their accuracy is still limited, and no bioinformatics inference algorithms exist for many types of biological information. The experimental literature contains vast troves of valuable information, and despite advances in text mining algorithms, curation by experienced biologists is the only way to accurately extract that information.
EcoCyc curators extract a wide range of information on protein function; on metabolic pathways; and on regulation at the transcriptional, translational, and post-translational levels.
In the past year SRI has performed significant curation on the BioCyc databases for Saccharomyces cerevisiae, Bacillus subtilis, Mycobacterium tuberculosis, Clostridium difficile, and (to be released shortly) Corynebacterium glutamicum. All told, BioCyc databases have been curated from 66,000 publications, and constitute a unique resource in the microbial informatics landscape. Yet much more information remains untapped in the biomedical literature, and new information is published at a rapid pace. That information can be extracted only by professional curators who understand both the biology, and the methods for encoding that biology in structured databases. Without adequate financial resources, we cannot hire these curators, whose efforts are needed on an ongoing basis.
Why Do We Seek Financial Support from the Scientific Community?
The EcoCyc project has been fortunate to receive government funding for its development since 1992. Similar government-supported databases exist for a handful of biomedical model organisms, such as fly, yeast, worm, and zebrafish.
Peter Karp has been advocating that the government fund similar efforts for other important microbes for the past twenty years, such as for pathogens, biotechnology workhorses, model organisms, and synthetic-biology chassis for biofuels development. He has developed the Pathway Tools software as a software platform to enable the development of curated EcoCyc-like databases for other organisms, and the software has been used by many groups. However, not only has government support for databases not kept pace with the relentless increases in experimental data generation, but the government is funding few new databases, is cutting funding for some existing databases (such as for EcoCyc, for BioCyc, and for TAIR), and is encouraging the development of other funding models for supporting databases . Funding for BioCyc was cut by 27% at our last renewal whereas we are managing five times the number of genomes as five years ago. We also find that even when government agencies want to support databases, review panels score database proposals with low enthusiasm and misunderstanding, despite the obvious demand for high-quality databases by the scientific community.
Put another way: the Haemophilus influenzae genome was sequenced in 1995. Now, twenty years later, no curated database that is updated on an ongoing basis exists for this important human pathogen. Mycobacterium tuberculosis was sequenced in 1998, and now, eighteen years later, no comprehensive curated database exists for the genes, metabolism, and regulatory network of this killer of 1.5 million human beings per year. No curated database exists for the important gram-positive model organism Bacillus subtilis. How much longer shall we wait for modern resources that integrate the titanic amounts of information available about critical microbes with powerful bioinformatics tools to turbocharge life-science research?
How it Will Work and How You Can Support BioCyc
The tradition whereby scientific journals receive financial support from scientists in the form of subscriptions is a long one. We are now turning to a similar model to support the curation and operation of BioCyc. We seek individual and institutional subscriptions from those who receive the most value from BioCyc, and who are best positioned to direct its future evolution. We have developed a subscription-pricing model that is on par with journal pricing, although we find that many of our users consult BioCyc on a daily basis — more frequently than they consult most journals.
We hope that this subscription model will allow us to raise more funds, more sustainably, than is possible through government grants, through our wide user base in academic, corporate, and government institutions around the world. We will also be exploring other possible revenue sources, and additional ways of partnering with the scientific community.
BioCyc is collaborating with Phoenix Bioinformatics to develop our community-supported subscription model. Phoenix is a nonprofit that already manages community financial support for the TAIR Arabidopsis database, which was previously funded by the NSF and is now fully supported  by users.
Phoenix Bioinformatics will collect BioCyc subscriptions on behalf of SRI International, which like Phoenix is a non-profit institution. Subscription revenues will be invested into curation, operation, and marketing of the BioCyc resource.
We plan to go slow with this transition to give our users time to adapt. WeÕll begin requiring subscriptions for access to BioCyc databases other than EcoCyc and MetaCyc starting in July 2016.
Access to the EcoCyc and MetaCyc databases will remain free for now.
Subscriptions to the other 7,600 BioCyc databases will be available to institutions (e.g., libraries), and to individuals. One subscription will grant access to all of BioCyc. To encourage your institutional library to sign up, please contact your science librarian and let him or her know that continued access to BioCyc is important for your research and/or teaching.
Subscription prices will be based on website usage levels and we hope to keep them affordable so that everyone who needs these databases will still be able to access them. We are finalizing the academic library and individual prices and will follow up soon with more information including details on how to sign up. We will make provisions to ensure that underprivileged scientists and students in third-world countries arenÕt locked out.
Please spread the word to your colleagues — the more groups who subscribe, the better quality resource we can build for the scientific community.
Director, SRI Bioinformatics Research Group
This tutorial is both a work in progress and a living document. If you see an error, or want something added, please let me know by leaving a comment.
Starting with version 3.0.0 paprica contains a metagenomic annotation module. This module takes as input a fasta or fasta.gz file containing the QC’d reads from a shotgun metagenome and uses DIAMOND Blastx to classify these reads against a database derived from the paprica database. The module produces as output:
- Classification for each read in the form of an EC number (obviously this applies only to reads associated with genes coding for enzymes).
- A tally of the occurrence of each EC number in the sample, with some useful supporting information.
- Optionally, the metabolic pathways likely to be present within the sample.
In addition to the normal paprica-run.sh dependencies paprica-mg requires DIAMOND Blastx. Follow the instructions in the DIAMOND manual, and be sure to add the location of the DIAMOND executables to your PATH. If you want to predict metabolic pathways on your metagenome you will need to also download the pathway-tools software. See the notes here.
There are two ways to obtain the paprica-mg database. You can obtain a pre-made version of the database by downloading the files paprica-mg.dmnd and paprica-mg.ec.csv.gz (large!) to the ref_genome_database directory. Be sure to gunzip paprica-mg.ec.csv.gz before continuing. If you wish to build the paprica-mg database from scratch, perhaps because you’ve customized that database or are building it more frequently than the release cycle, you will need to first build the regular paprica database. Then build the paprica-mg database as such:paprica-mg_build.py -ref_dir ref_genome_database
If you’ve set paprica up in the standard way you can be execute this command from anywhere on your system; the paprica directory is already in your PATH, and the script will look for the directory “ref_genome_database” relative to itself. Likewise you don’t need to be in the paprica directory to execute the paprica-mg_run.py script.
Once you’ve downloaded or built the database you can run your analysis. It is worth spending a little time with the DIAMOND manual and considering the parameters of your system. To try things out you can download a “smallish” QC’d metagenome from the Tara Oceans Expedition (selected randomly for convenient size):## download a test metagenome wget http://www.polarmicrobes.org/extras/ERR318619_1.qc.fasta.gz ## execute paprica-mg for EC annotation only paprica-mg_run.py -i ERR318619_1.qc.fasta.gz -o test -ref_dir ref_genome_database -pathways F
This will produce the following output:
test.annotation.csv: The number of hits in the metagenome, by EC number. See the paprica manual for a complete explanation of columns.
test.paprica-mg.nr.daa: The DIAMOND format results file. Only one hit per read is reported.
test.paprica-mg.nr.txt: A text file of the DIAMOND results. Only one hit per read is reported.
Predicting pathways on a metagenome is very time intensive and it isn’t clear what the “correct” way is to do this. I’ve tried to balance speed with accuracy in paprica-mg. If you execute with -pathways T, DIAMOND is executed twice; once for the EC number annotation as above (reporting only a single hit for each read), and once to collect data for pathway prediction. On that search DIAMOND reports as many hits for each read as there are genomes in the paprica database. Of course most reads will have far fewer (if any) hits. The reason for this approach is to try and reconstruct as best as possible the actual genomes that are present. For example, let’s say that a given read has a significant hit to an enzyme in genome A and genome B. When aggregating information for pathway-tools the enzyme in genome A and genome B will be presented to pathway-tools in separate Genbank files representing separate genetic elements. Because a missing enzyme in either genome A or genome B could cause a negative prediction for the pathway, we want the best chance of capturing the whole pathway. So a second enzyme, critical to the prediction of that pathway, might get predicted for only genome A or genome B. The idea is that the incomplete pathways will get washed out at the end of the analysis, and since pathway prediction is by its nature non-redundant (each pathway can only be predicted once) over-prediction is minimized. To predict pathways during annotation:## execute paprica-mg for EC annotation and pathway prediction paprica-mg_run.py -i ERR318619_1.qc.fasta.gz -o test -ref_dir ref_genome_database -pathways T -pgdb_dir /location/of/ptools-local
In addition to the files already mentioned, you will see:
test_mg.pathologic: a directory containing all the information that pathway-tools needs for pathway prediction.
test.pathways.txt: A simple text file of all the pathways that were predicted.
test.paprica-mg.txt: A very large text file of all hits for each read. You probably want to delete this right away to save space.
test.paprica-mg.daa: A very large DIAMOND results file of all hits for each read. You probably want to delete this right away to save space.
testcyc: A directory in ptools-local/pgdbs/local containing the PGDB and prediction reports. It is worth spending some time here, and interacting with the PGDB using the pathway-tools GUI.
Read Sidney Hemming’s first post to learn more about the goals of her two-month research cruise off southern Africa and its focus on the Agulhas Current and collecting climate records for the past 5 million years.
We reached our last site yesterday morning, off Cape Town, South Africa, and the first core was on deck at 11:15 a.m. It was pretty stiff at the bottom, and its microfossils indicated it was more than 250,000 years at 1 meter. We decided to start again, and ended up with 6 meters in the first core of the B hole, with further indications of a very low sediment accumulation rate of approximately 1.5 cm per thousand years. The next few cores gave us troubles with shattered liners and low recovery, so Ian and I went back to the data from the alternative sites to consider moving. Luckily we decided not to, because things really started looking up.
We just completed the first hole with 300 meters of sediment and a base age of more than 7 million years, and with a quite pleasing accumulation rate below the very top part. We still don’t have quite enough information to evaluate the situation in the upper 1 million years, but it seems very clear that the rest of the site will be excellent. The sediment composition is very similar from top to bottom and very rich in carbonate (so called nannofossil ooze). The gamma ray (measures radioactivity and thus is a sensitive measure of clay) and color measurements give a very nice signal and are varying in concert with each other. The weather has gotten nicer since the beginning of the first hole, and we are hoping the conditions hold and that the sea conditions were the reason for the troubles at the beginning of our first hole. Meanwhile, we have just enough time to complete the triple coring of this site back to 7 million, with maybe enough time for logging of the final hole.
So, the good fortune continues. Each site on this cruise has provided real prize material, and the team members are very eager to get started on the work back at home. We have been burning the midnight oil (or midday, depending on your shift), meeting about the various plans for post-cruise science. There remain a couple of conflicts to resolve, but overall it looks like there will be plenty of great science for each participant, and plenty of opportunities to develop career-long collaborations.
It has been a great privilege to be part of this, and it really makes you realize how powerful these huge efforts, that require the cooperation of so many countries and their scientists, are. It is a very different way of doing science, and not always convenient for the individual, but overall the benefits are huge.
Meanwhile, this is my last post for this cruise. We are less than a week from arriving at our dock in Cape Town, and there is no question that we are all quite eager to get there. The JOIDES Resolution is amazing and the (multiple) staffs of the ship company, catering service, and IODP are truly remarkable. They are friendly, professional and very eager to help us to get the best we can out of this amazing scientific discovery process.
Sidney Hemming is a geochemist and professor of Earth and Environmental Sciences at Lamont-Doherty Earth Observatory. She uses the records in sediments and sedimentary rocks to document aspects of Earth’s history.
I’m happy to report that one of the appendices in my dissertation was just published in the journal Polar Biology. The paper, titled Wind-driven distribution of bacteria in coastal Antarctica: evidence from the Ross Sea region, was a long time in coming. I conceived of the idea back in 2010 when it looked like my dissertation would focus on the microbial ecology of frost flowers; delicate, highly saline, and microbially enriched structures found on the surface of newly formed sea ice. Because marine bacteria are concentrated in frost flowers we wondered whether they might serve as source points for microbial dispersal. This isn’t as far-fetched as it might seem; bacteria are injected into the atmosphere through a variety of physical processes, from wind lofting to bubble-bursting, and frost flowers have been implicated as the source of wind deposited sea salts in glaciers far from the coast.
At the time we’d been struggling to reliably sample frost flowers at our field site near Barrow, Alaska. Frost flowers form readily there throughout the winter, but extremely difficult sea ice conditions make it hard to access the formation sites. We knew that there were more accessible formation sites in the coastal Antarctic, so we initiated a one year pilot project to sample frost flowers from McMurdo Sound. By comparing the bacterial communities in frost flowers, seawater, sea ice, terrestrial snow, and glaciers, we hoped to show that frost flowers were a plausible source of marine bacteria and marine genetic material to the terrestrial environment. Because the coastal Antarctic contains many relic marine environments, such as the lakes of the Dry Valleys, the wind-driven transport of bacteria from frost flowers and other marine sources could be important for continued connectivity between relic and extant marine environments.
Frost flowers are readily accessible in McMurdo Sound throughout the winter, however, this does not mean that one can simply head out and sample them. While the ice conditions are far more permissible than at Barrow, Alaska, the bureaucracy is also far more formidable. The can-do attitude of our Inupiat guides in Barrow (who perceive every far-out field plan as a personal challenge) was replaced with the inevitable can’t-do attitude at McMurdo (this was 2011, under the Raytheon Antarctic Support Contract, and does not reflect on the current Lockheed Antarctic Support Contract, not to suggest that this attitude doesn’t persist). Arriving in late August we were initially informed that our plan was much to risky without helicopter support, and that nothing could be done until mid-October when the helicopters began flying (we were scheduled to depart late October). Pushing for a field plan that relied on ground transport ensnared us in various catch-22’s, such as (paraphrased from an actual conversation):
ASC representative: You can’t take a tracked vehicle to the ice edge, they’re too slow.
Me: Can we take a snowmobile to the ice edge? That would be faster. We do long mid-winter trips in the Arctic and it works out fine.
ASC representative: No, because you have to wear a helmet, and the helmets give you frostbite. So you can only use a snowmobile when it’s warm out.
Ultimately we did access the ice edge by vehicle several times before the helicopters started flying, but the samples reported in this publication all came from a furious two week period in late October. What we found really surprised us.
There is ample evidence for the wind-driven transport of bacteria in this region but, contrary to our hypothesis, most of that material is coming from the terrestrial environment. The major transportees were a freshwater cyanobacterium from the genus Pseudanabaena and a set of sulfur-oxidizing Gammaproteobacteria (GSO). The cyanobacterium was pretty easy to understand; it forms mats in a number of freshwater lakes and meltponds in the region. In the winter these freeze, and since snow cover is low, ice and microbial mats are ablated by strong winter winds. Little pieces of mat are efficiently scattered all over, including onto the sea ice surface.
The GSO threw us for more of a loop; the most parsimonious explanation for their occurrence in frost flowers is that they came from hydrothermal features on nearby Mt. Erebus. We did some nice analysis with wind vectors in the region and while you don’t get a lot of wind (integrated over time) to move material from Mt. Erebus to our sample sites, you do get some occasional very strong storms.
What all this means is that, consistent with other recent findings, there is high regional dispersal of microbes around the coastal Antarctic. While I’m sure there are some endemic microbes occupying particularly unique niches, in general I expect microbes found in one part of the coastal Antarctic to be present in a similar environment in a different part of the coastal Antarctica. There are however, quite a few ways to interpret this. Bacteria and Archaea can evolve very fast, so the genome of a clonal population of (for example) wind deposited Pseudanabaena newly colonizing a melt pond can diverge pretty fast from the genome of the parent population. This has a couple of implications. First it means that the coastal Antarctic, with all it’s complex topography yet high degree of microbial connectivity, is an excellent place to explore the dynamics of microbial adaptation and evolution, particularly if we can put constraints on the colonization timeline for a given site (non trivial). Second, it raises some questions about the propriety of commercially relevant microbes obtained from the continent. The commercialization of the continent is probably inevitable (I hope it is not), perhaps the potential ubiquity of Antarctic microbes will provide some defense against the monopolization of useful strains, enzyme, and genes.