Earth Science Data in the Cloud: The EOSDIS Cumulus Project
As part of the ongoing evolution of EOSDIS data and services, testing and prototyping are underway to see how DAAC data collections can be archived and disseminated using the commercial cloud.
Josh Blumenfeld, EOSDIS Science Writer
Clouds in the sky constantly grow and shrink as they adjust to evolving atmospheric conditions. A cloud computing environment, like an atmospheric cloud, also easily can adjust to evolving conditions, expanding or contracting as needed based on data storage requirements and the needs of data users. This flexibility helps make the commercial cloud a viable option for archiving and disseminating large volumes of data or for managing data holdings that are expected to change rapidly over a short amount of time.
NASA’s Earth Observing System Data and Information System (EOSDIS) is responsible for a data collection that is both large in volume and projected to grow rapidly over the next several years. From its current size of almost 22 petabytes (PB), the volume of data in the EOSDIS archive is expected to increase to almost 247 PB by 2025, according to estimates by NASA’s Earth Science Data Systems (ESDS) Program.
To prepare for this tremendous growth and efficiently provide access to these data, the EOSDIS is investigating the evolution of its data and services to run in the commercial cloud. As part of these efforts, staff at NASA’s Earth Science Data and Information System (ESDIS) Project (which manages EOSDIS data) are prototyping and testing how EOSDIS data collections can be archived collectively and disseminated in the cloud. As befitting the cloud environment, this prototype is called Cumulus.
A primary feature of Cumulus is a cloud-based framework for data ingest, archive, distribution, and management, which are the primary activities of the discipline-specific EOSDIS Distributed Active Archive Centers (DAACs). The overall Cumulus goal is to provide the following functionality in the commercial cloud:
- Data acquisition from data providers (such as NASA science teams),
- Data ingest (including validation and processing),
- The harvest, creation, and publication of dataset metadata to the EOSDIS Common Metadata Repository (CMR),
- The storage and distribution of data, including disaster recovery, and
- Publication of metrics to the ESDIS Metrics System (EMS), which collects and organizes various metrics from the DAACs and other data providers.
The DAACs would still serve as gateways to EOSDIS Earth science data and continue to provide a wide range of support services for data users. EOSDIS data users likely would not even notice any difference in interactions with their discipline-specific DAACs and Earthdata Search when searching for and downloading data that happens to be stored in the cloud.
Selected EOSDIS data and services already are operating in Amazon Web Services (AWS), which currently is the only NASA-approved commercial cloud provider. Earthdata Search and the CMR evolved to the cloud in September 2016 and April 2017, respectively. Global Imagery Browse Services (GIBS), which provides access to over 400 satellite imagery products that can be viewed using client applications such as EOSDIS Worldview, is expected to evolve to the cloud as a prototype starting in 2018. The next step in this evolution is to prototype and test EOSDIS data collections in the cloud. This is an important undertaking given the expected significant growth of the EOSDIS archive.
Between 2015 and 2017, the volume of data in the EOSDIS archive more than doubled, from roughly 10 PB to almost 22 PB (see graphic). This dramatic growth in the EOSDIS data archive volume is expected to continue over the next several years. The ESDIS Project believes that having these data archived and disseminated in the commercial cloud provides the EOSDIS with a cost-effective, flexible, and scalable data system that can keep pace with mission advancements and capabilities. The cloud also gives EOSDIS data users the ability to efficiently access and process significantly larger data volumes from multiple DAACs and do more with these data without having to download large data collections (including the ability to work with data directly in the cloud). Other benefits of having EOSDIS data in the cloud are described in detail on the EOSDIS Cloud Evolution page from the Earthdata website.
Along with testing data storage and dissemination in the cloud, ESDIS Project staff are working with the DAACs to develop compatible tools and libraries that can be used across DAACs and with multiple data collections (such as widely applicable geographic information system [GIS] components or sub-setters) as well as enabling discipline-specific data customizations (such as the ability to work with specific subsets of data). If prototypes developed as part of Cumulus are successful, an entire DAAC could be running in the cloud by 2019 or 2020.
Of course, not every DAAC or DAAC data collection may be appropriate for evolution to the cloud, and DAAC functions and data products are being reviewed to determine which collections might work best in the cloud. This includes an evaluation of current DAAC data volume size and growth along with DAAC data distribution characteristics.
Along with EOSDIS data products, ESDIS Project Cumulus team members will continue testing dataset-specific tools and applications in the cloud. These tools and applications will be migrated to the cloud on a case-by-case basis. The overall objective is to containerize specific tools and applications to run in the commercial cloud, especially those that can be used across multiple DAACs and with multiple datasets.
While Cumulus is still in the testing and prototyping phase with individual DAACs and selected data, these efforts will create a firm foundation for managing the expected significant growth of the EOSDIS archive and provide options for efficiently disseminating these data. More importantly, EOSDIS cloud efforts like Cumulus ensure that data users will continue to be able to maximize both their research using NASA Earth science data and the amount of data with which they can work.
Last Updated: Dec 7, 2017 at 2:09 PM EST