1. Learn
  2. Analysis In Place

Analysis In Place

Resources and tools for working with and analyzing cloud-resident data

Analysis-in-Place (AiP) is a way of working with remotely-sensed data that resides in a cloud environment and doesn’t typically move (“in-place” means “in the cloud”). This is different from current practices of downloading data to your computer in order to analyze them. AiP allows researchers and programmers to perform that legacy workflow, using cloud-native APIs and data, without having to move the data to another location. Not only does this add efficiencies (new datasets can be many terabytes), but it brings the full power of scalable cloud computing to the analysis phase of the workflow.

Why Should I Use AiP?

There are many reasons why you’d want to implement an AiP workflow:

  1. Dataset and individual file size
    Some projects, datasets, and even individual files are too large to realistically download and analyze in a local environment. With new missions like NASA SWOT and NISAR, data throughput and file sizes will exceed 100s of terabytes per year–making the search-download-analyze workflow almost impossible. AiP will allow users to bring the analysis to the data, complete with specific tooling, without downloading from cloud storage.

    2. Parallel processing
    AiP workflows allow you to utilize parallel computing to process larger datasets in less time by employing a literal army of server cores for a short amount of time. Projects like Dask provide flexible and fast scaling using familiar Python packages.

    3. Simpler Workflow
    Analyzing in place frees the user from the burden of downloading and managing the many, many files that make up large datasets.

    4. Dataset-specific tools
    With platforms like Pangeo offering a standardized set of tools for processing and analyzing geospatial data in the cloud–a standard set of workflow and tools are emerging to make AiP easier and more accessible.

Getting Started

The easiest way to understand the power of an AiP workflow is to give it a try with real data. In this example, we'll take a look at sea surface temperature over the Great Lakes region using the Group for High Resolution Sea Surface Temperature (GHRSST) Level 4 Multi-scale Ultra-high Resolution (MUR) Global Foundation Sea Surface Temperature (SST) Analysis (v4.1) and Zarr (the Earth Observation System Data and Information System (EOSDIS) zarr-eosdis-store library).

You can follow along below or view the completed and runnable Jupyter Notebook. Note that this isn't a "true" AiP example unless you run this code in the cloud (Amazon Web Service [AWS] Elastic Cloud Compute [EC2] instance in us-west-2 or a hosted notebook like JupyerHub). An additional example is available which discusses using AWS S3 endpoints and XArray.

You can also run the notebook file as a plain old Python script (locally or in an EC2 instance close to the data) using the nbconvert library.

SST Example

SST Example

Before you start, make sure you have an Earthdata Login account and you've placed your credentials in ~/.netrc. Never commit this file to a code repository:

machine urs.earthdata.nasa.gov login YOUR_USER password YOUR_PASSWORD

Initial Dependencies

import sys

# zarr and zarr-eosdis-store, the main libraries being demoed
!{sys.executable} -m pip install zarr zarr-eosdis-store

# Notebook-specific libraries
!{sys.executable} -m pip install matplotlib

Basic usage. After these lines, we work with ds as though it were a normal Zarr dataset:

import zarr
from eosdis_store import EosdisStore

url = 'https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/MUR-JPL-L4-GLOB-v4.1/20210715090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc'

ds = zarr.open(EosdisStore(url)

View the file's variable structure

 ├── analysed_sst (1, 17999, 36000) int16
 ├── analysis_error (1, 17999, 36000) int16
 ├── dt_1km_data (1, 17999, 36000) int16
 ├── lat (17999,) float32
 ├── lon (36000,) float32
 ├── mask (1, 17999, 36000) int16
 ├── sea_ice_fraction (1, 17999, 36000) int16
 ├── sst_anomaly (1, 17999, 36000) int16
 └── time (1,) int32

Fetch the latitude and longitude arrays and determine start and end indices for our area of interest

In this case, we're looking at the Great Lakes, which have a nice, recognizable shape. Latitudes 41 to 49, longitudes -93 to 76.

lats = ds['lat'][:]
lons = ds['lon'][:]
lat_range = slice(lats.searchsorted(41), lats.searchsorted(49))
lon_range = slice(lons.searchsorted(-93), lons.searchsorted(-76)

Get the analyzed sea surface temperature variable over our area of interest and apply scale factor and offset from the file metadata

In a future release, scale factor and add offset will be automatically applied.

var = ds['analysed_sst']
analysed_sst = var[0, lat_range, lon_range] * var.attrs['scale_factor'] + var.attrs['add_offset']

Draw a pretty picture

from matplotlib import pyplot as plt

plt.rcParams["figure.figsize"] = [16, 8]
plt.imshow(analysed_sst[::-1, :])

aip example

In a dozen lines of code and few seconds time, we've managed to fetch and visualize 3.2MB we needed from a 732MB file using the original archive URL and no processing services.

More Use Case Examples

Data Archive Use Case Stack
Alaska Satellite Facility Distributed Active Archive Center (ASF DAAC) Vertex RTC On-Demand Processing Vertex On-Demand
ASF DAAC Open SARlab Change Detection Jupyter Notebook
Land Processed DAAC (LP DAAC) Landsat 8 on AWS Pangeo
Physical Oceanography DAAC (PO.DAAC) MUR SST collection in AWS registry of open data Pangeo

Additional Resources

Page Last Updated: Dec 14, 2021 at 11:25 AM EST