Community Tools for Analysis of NASA Earth Observation System Data in the Cloud
Principal Investigator (PI): Anthony Arendt, University Of Washington, Seattle
Co-Investigators (Co-PIs): Joe Hamman, University Corporation For Atmospheric Research; Dan Pilone, Element 84, Inc.
Data intensive scientific workflows are at a pivotal stage in which traditional local computing resources are no longer capable of meeting the storage or computing demands of scientists. In the Earth System Sciences (ESS) community, we are facing an explosion of data volumes where new datasets, sourced from models, in-situ observations, and remote sensing platforms, are being made available at prohibitively large volumes to store at even medium to large High Performance Computing (HPC) centers. NASA has estimated that by 2025, it will be storing upwards of 250 Petabytes (PB) of its data using commercial cloud services (e.g. Amazon Web Services (AWS)). Availability of these data in cloud environments, co-located with a wide range of computing resources, will revolutionize how scientists use these datasets and provide opportunities for important scientific advancements. Fully leveraging these opportunities will require new approaches in the way the ESS community handles data access, processing, and analysis.
These technologies will be deployable on commercial cloud infrastructure where NASA's Earth Observing System Data and Information System (EOSDIS) is anticipated to be stored. At present, tools for working with these datasets consist of convenient interfaces for discovering and downloading data (e.g. Earthdata search) from individual Distributed Active Archive Centers (DAACs). We anticipate that the transition to cloud storage for many of these DAACs will bring immense opportunities and specific challenges to researchers.
This project will facilitate the ESS community's transition into cloud computing by developing technologies that build on existing open-source tools (e.g., Python, Jupyter) by integrating building on top of the growing Pangeo ecosystem.
Our first task is to deploy a scalable cloud-based JupyterHub on AWS for community use. JupyterHub is a multi-user, multi-language interactive computing environment that facilitates open-ended, exploratory analysis and data visualization. Content ("notebooks") developed on JupyterHub are both functional and fluid in the manner of an "executable paper" combining data, processing, and interpretation, a necessary departure from traditional publication as a sequence of static artifacts.
Our second task is to integrate existing NASA data discovery tools with cloud-based data access protocols. Existing data discovery tools, such as the Common Metadata Repository (CMR) and Global Imagery Browser Services (GIBS), provide convenient access to dataset metadata but navigating the access, retrieval, and processing steps for these datasets is left to individual users. We are developing an advanced Python application program interface (API) that leverages high-level tools like Xarray and Dask allowing scientists to accelerate their analysis. Integration of this API with the Pangeo ecosystem provides our API with cutting edge scientific tools for pre-processing, regridding, machine learning, and visualization.
Our third task is to leverage our advanced API for data discovery and processing to provide an advanced, cloud-optimized framework for remote data retrieval. Our approach to a data retrieval system goes beyond simple slice and download operations (e.g., Open-source Project for a Network Data Access Protocol (OPeNDAP)) and leverages our advanced API for data discovery, access, and processing to also provide server-side perfunctory processing.
We demonstrate the use of these tools with several datasets including the North American Land Data Assimilation System (NLDAS), Gravity Recovery and Climate Experiment (GRACE), and Sentinel-1 synthetic aperture radar (SAR). The example applications serve as templates for the broader community and real-world applications for evaluation of the cloud services and applications we develop.
The project will help accelerate a shift in the ESS culture toward cloud computing by providing short but intensive training opportunities providing new ways for scientists to collaborate and make full use of NASA satellite datasets.
Update October 2019
The project has made steady progress towards the goal of facilitating the Geoscience community's transition into cloud computing by building on top of the growing Pangeo ecosystem.
During the first year, the project:
- Developed a Python-based API to access NASA meta data repositories.
- Developed a Python-based API to access cloud hosted NASA datasets.
- Integrated Python APIs with the Pangeo Ecosystem.
- Configured and deployed JupyterHub on cloud resources.
- Developed the ability to run multiple JupyterHubs on AWS.
- Configured and deployed Dask on cloud resources.
- Made open-source contributions to Xarray, Dask, Zarr, GDAL, Rasterio, Jupyter (Jupyter, Hubploy, BinderHub, etc.), Data catalogs (e.g., CMR, STAC, Intake, Intake-STAC), and the Pangeo infrastructure (pangeo-stacks, pangeo-datastore, pangeo-cloud-federation).
- Provided Outreach and Training and participated in the on-going outreach efforts of the Pangeo project. The Pangeo JupyterHub was deployed on AWS to support several hackweeks offered by the UW eScience Institute and the Applied Physics Laboratory. These included the Cryospheric Sciences with ICESat-2 on June 17-21, 2019, and Geohackweek on Sept 9-13, 2019. Both events were ideal environments to test of scalability of the JupyterHub infrastructure for more than 50 simultaneous users.
- Developed documentation and interactive examples as well as scientific use cases that combine NASA data stored on the Cloud with the Pangeo software stack.
Abernathey, R., Hamman, J., Heagy, L., Fernandes, F. & Hoese, D. (In press). Open Source Frameworks for Earth Data Science are Blossoming. EOS. In revision.
Eynard-Bontemps, G., Abernathey, R., Hamman, J., Ponte, A., & Rath, W. (2019). The Pangeo Big Data Ecosystem and its use at CNES. In P. Soille, S. Loekken, and S. Albani, Proc. of the 2019 conference on Big Data from Space (BiDS’2019), 49-52. EUR 29660 EN, Publications Office of the European Union, Luxembourg. doi: 10.2760/848593.
Hamman, J., Robinson, N. & Abernathey, R. (2019). Science needs to rethink how it interacts with big data. Nature, In revision.
Abernathey R., Hamman, J., & Miles, A. (2018, December). Beyond netCDF: Cloud Native Climate Data with Zarr and Xarray. Abstract IN33A-06 presented at 2018 Fall Meeting, AGU, Washington, DC.
Arendt, A., Hamman, J., Rocklin, M., Tan, A., Fatland, D.R., Joughin, J., Gutmann, E.D., Setiawan, L., & Henderson, S. (2018, December). Pangeo: Community tools for analysis of Earth Science Data in the Cloud. Abstract IN54A-05 presented at 2018 Fall Meeting, AGU, Washington, DC.
Hamman, J., Arendt, A., Pilone, D., Henderson, S., Fatland, R., Tan, A., & Pawloski, A. (2019, March). Community tools for analysis of NASA Earth Observing System Data in the Cloud. Poster presented at the 2019 Earth Science Data System Working Group (ESDSWG) Meeting, Annapolis, MD.
Hamman, J., & Rocklin, M. (2018, November). Pangeo: A community-driven effort for Big Data geoscience. Seminar presented at the UK Met Office, Exeter, UK.
Hamman, J., Abernathey, R., Holdgraph, C., Panda, Y., & Rocklin, M. (2018, December). Pangeo and Binder: Scalable, shareable and reproducible scientific computing environments for the geosciences. Abstract IN53A-03 presented at 2018 Fall Meeting, AGU, Washington, DC.
Hamman, J., Abernathey, R., & Henderson S. (2018, December). Pangeo: Scalable Geoscience Tools in Python—Xarray, Dask, and Jupyter. Abstract WS22 presented at 2018 Fall Meeting, AGU, Washington, DC.
Hamman, J. (2018, October). The Pangeo ecosystem for data proximate analytics. Presented at the 2018 Workshop on developing Python frameworks for earth system sciences, European Centre for Medium-Range Weather Forecasts (ECMWF), Reading, UK.
Hamman, J., & Banihirwe, A. (2019, March). Pangeo: Scalable Geoscience Tools in Python — Xarray, Dask, and Jupyter. Clinic presented at the 2019 Community Surface Dynamics Modeling System (CSDMS) Meeting, Boulder, CO.
Hamman, J., Abernathey, R. (2019, August). Handle "Big" Larger-than-memory Data. Clinic presented at the 2019 Oceanhackweek, Seattle, WA.
Hanson, M. (2019, August). How Open Communities are Revolutionizing Science. Keynote presented at FOSS4G 2019, Bucharest, Romania.
Henderson, S., Arendt, A., Tan, A., Pawlowski, A. (2019, July). Using Pangeo JupyterHubs to work with large public datasets. Workshop presented at 2019 ESIP Summer Meeting, Tacoma, WA.
Henderson, S.T. (2018). Regional interferometric synthetic aperture radar (InSAR) snowpack measurements. Presented at the Boise State Geosciences Department Seminar.
Henderson, S.T. (2019). Moving satellite radar processing and analysis to the Cloud. Presented at UNAVCO ISCE Short Course.
Henderson, S.T. (2018). Scalable InSAR processing and analysis in the Cloud with applications to geohazard monitoring in the Pacific Northwest. AGU Fall Meeting.
Henderson, S.T. (2018). Towards automated Sentinel-1 satellite radar imagery classification for hazard monitoring. University of Washington Seismolunch seminar.
Henderson, S.T. (2019). Benefits of InSAR archives on the Cloud for volcano monitoring. Cascades Volcano Observatory NISAR CVO Workshop.
Henderson, S.T. (2019). Cloud Native Analysis of Earth Observation Satellite Data with Pangeo. ESIP Tech Dive Webinar.
Henderson, S.T. (2019). Pangeo: An open community platform for big data geoscience analysis and visualization. Woods Hole Oceanographic Institute.
Henderson, S.T. (2019). Pangeo: Community tools for analysis of Earth Science Data in the Cloud. Cascadia Chapter (CUGOS) of the Open Source Geospatial Foundation (OSGeo).
Henderson, S.T. (2019). Scalable, data-proximate cloud computing for Earth Science research. Moderated session at ESIP Summer Meeting.
Henderson, S.T. (2019). Synthesizing open-source software for scalable cloud computing with NASA imagery. FOSS4G-NA Conference, San Diego.
2019 Pangeo Community Meeting: Organized and hosted by team members Arendt, Hamman, Henderson, Tan, and Fatland.
2019 ESDSWG: Attended by team members Arendt and Hamman.
2019 UW Hackweeks (Ocean, Geo): Organized by team members Arendt, Tan, and Henderson and incorporating elements from this project.
Last Updated: Nov 25, 2019 at 8:24 AM EST