By Marie DeNoia Aronsohn
Researchers at Columbia University’s Lamont-Doherty Earth Observatory (LDEO) and their colleagues have just won a $1.2 million grant from the National Science Foundation. In collaboration with the National Center for Atmospheric Research (NCAR) and Anaconda—a private software company and leader in the emerging field of data science—the team will develop an innovative software called “Pangeo,” which aims to help climate scientists confront the challenges of Big Data.
“Pangeo: An Open Source Big Data Climate Science Platform” is a project designed to solve one of climate science’s most pressing challenges: accessing and utilizing the explosive growth in the size of climate datasets, which have become a bulky but indispensable tool for scientific inquiry in climate change research. Earth system models (ESMs)—numerical simulations of the interactions between the ocean, atmosphere, land, cryosphere, and biosphere—are high performance applications that run on supercomputing clusters, which are sets of connected computers working together. Over the past decades, the computing power of these clusters has increased exponentially, and computerized climate models, the ESMs, have adapted by increasing in complexity and size. At the same time, there has been little change in how ESM output data is processed, analyzed, and visualized, leading to a paradoxical crisis in climate science: the better the models become, the harder it is to use them for research.
The Pangeo project is a community-driven effort that emerged out of a workshop held at Columbia University in November 2016. The goal of Pangeo is to integrate a suite of open-source software packages in the Python programming language, including Xarray and Dask, to produce a comprehensive toolkit for analysis of climate datasets. This toolkit will enhance the Data Science aspect of EarthCube, an NSF program designed to develop cyberinfrastructure to meet the current and future requirements of geoscientists.
A visualization of ocean currents from a global simulation run with 1km of spatial resolution. The rich, detailed structure in such simulations can help scientists better understand ocean physics, but the datasets they produce can be enormous. This model run produced more than 2 petabytes (2 million gigabytes) of data. Credit: NASA/JPL-Caltech
The need for a new, integrated approach to ESM data has never been more urgent, said Lamont physical oceanographer Ryan Abernathey, leader of the Pangeo project. “Earth’s climate system is experiencing unprecedented change as anthropogenic greenhouse emissions continue to perturb the global energy balance. Understanding and forecasting the nature of this change, and its impact on human welfare, is both a profound scientific challenge and an urgent societal problem,” said Abernathey. “While other scientific fields are able to conduct laboratory experiments to better understand complex phenomena, climate scientists only have one planet to observe. That’s why climate simulations, and the data they produce, are growing ever more vital to our planet.”
NCAR postdoctoral researcher and computational hydroclimatologist Joe Hamman has been developing open-source software for data analysis and is a collaborator on the Pangeo project.
“Climate science is in desperate need for data analysis tools that can scale to the size of our ever-expanding datasets,” said Hamman. “Our challenge with the Pangeo project is to build out a platform of open-source software tools that address this need, and then to demonstrate to the geoscience community that these tools facilitate rapid scientific exploration of datasets.”
LDEO, NCAR, and Anaconda each bring distinct expertise to Pangeo. LDEO, a member of the Earth Institute, represents the science users capable of applying ESMs to study extreme weather, the water cycle, ocean turbulence, and more. NCAR has a mission to support and enhance the technological capabilities of the community and is one of the leading data repositories and computational facilities for climate science in the world. Anaconda (formerly Continuum Analytics) produces the world’s leading Python data science platform and supports the development of open-source scientific computing software.
“We’re excited to be part of this collaboration, and proud to do our part in improving our understanding of our planet’s atmosphere and oceans,” said Travis Oliphant, co-founder and chief data scientist at Anaconda. “Anaconda has been a long-time supporter of scientific discovery and we’re happy to help researchers leverage data science and take their analyses to a new scale.”
Together, these three enterprises will develop new game-changing software to help scientists access a world of information and to more easily analyze immense data sets, applying these powerful tools to important problems of climate science.
The emerging importance of Big Data in scientific research is a major theme for LDEO’s Real Time Earth strategic initiative and Columbia’s Data Science Institute. Both Abernathey and Pangeo collaborator Gavin Schmidt are members of the Data Science Institute and participate in the institute’s working group on Frontiers in Computing Systems.
“The fact is that understanding the earth system is fundamentally now a Big Data problem,” said Schmidt, director of the NASA Goddard Institute for Space Studies. “There is a tremendous amount of information that is currently unavailable to us because of the limitations of bandwidth, storage, and computing capacity. Pangeo is going to be working hard to change that.”