Hyrax Appendix E: Aggregation

Hyrax Data Server Installation and Configuration Guide

Appendices

Appendix E: Aggregation

Often it is desirable to treat a collection of data files as if they were a single dataset. Hyrax provides two different ways to do this: it enables data providers to define aggregations of files so those appear as a single dataset and it provides a way for users to send the server a list of files along with processing operations and receive a single amalgamated response.

In the first half of this appendix, we discuss aggregations defined by data providers. These aggregations use a simple mark-up language called NCML, first defined by Unidata as a way to work with NetCDF files. Both Hyrax and the THREDDS Data Server use NCML as a tool to describe how data files can be combined to aggregated data sets. In the second part of this appendix, we discuss user-specified aggregations. These aggregations currently use a new interface to the Hyrax server.

11.E.1. The NcML Module
Introduction
Information Icon In the past Hyrax was distributed as a collection of separate binary packages which data providers would choose to install to build up a server with certain features. As the number of modules grew, this became more and more complex and time consuming. As of Hyrax 1.12 we started distributing the server in three discreet packages - the DAP library, the BES daemon and all of the most important handlers (including the NcML handler described here) and the Hyrax web services front end. In some places in this documentation you may read about 'installing the handler' or other similar text, and can safely ignore that. If you have a modern version of the server it includes this handler.
Features

This version currently implements a subset of NcML 2.2 functionality, along with some OPeNDAP extensions:

  • Metadata Manipulation
    • Addition, Removal, and Modification of attributes to other datasets (NetCDF, HDF4, HDF5, etc.) served by the same Hyrax 1.6 server
    • Extends NcML 2.2 to allow for common nested "attribute containers"
    • Attributes can be DAP2 types as well as the NcML types
    • Attributes can be of the special "OtherXML" type for injecting arbitrary XML into a DDX response
  • Data Manipulation
    • Addition of new data variables (scalars or arrays of basic types as well as structures)
    • Variables may be removed from the wrapped dataset
    • Allows the creation of "pure virtual" datasets which do not wrap another dataset
  • Aggregations: JoinNew, JoinExisting, and Union:
    • JoinNew Aggregation
      • Allows multiple datasets to be "joined" by creating a new outer dimension for the aggregated variable
      • Aggregation member datasets can be listed explicitly with explicit coordinates for the new dimension for each member
      • Scan: Aggregations can be specified "automatically" by scanning a directory for files matching certain criteria, such as a suffix or regular expression.
      • Metadata may be added to the new coordinate variable for the new dimension
    • JoinExisting Aggregation
      • The ncoords element can be left out of the joinexisting granules. However, this may be a slow operation, depending on the number of granules in the aggregation.
      • Scan may also be used with ncoords attribute for uniform sized granules
      • Only allows join dimension to be aggregated from granules and not overridden in NcML
    • Union Aggregation
      • Merges all member datasets into one by taking the first named instance of variables and metadata from the members
      • Useful for combining two or more datasets with different variables into a single set
Configuration Parameters

TempDirectory

Where should the NCML handler store temporary data on the server’s file system.

Default value is '/tmp'.

NCML.TempDirectory=/tmp

GlobalAttributesContainerName

In DAP2 all global attributes must be held in containers. However, the default behavior for the handler is set for DAP4, where this requirement is relaxed so that any kind of attribute can be a global attribute. However, to support older clients that only understand DAP2, the handler will bundle top-level non-container attributes into a container. Use this option to set the name of that container. By default, the container is named NC_GLOBAL (because lots of clients look for that name), but it can be anything you choose.

NCML.GlobalAttributesContainerName=NC_GLOBAL
Testing Installation

Test data is provided to see if the installation was successful. The file sample_virtual_dataset.ncml is a dataset purely created in NcML and doesn’t contain an underlying dataset. You may also view fnoc1_improved.ncml to test adding attributes to an existing netCDF dataset (fnoc1.nc), but this requires the netCDF data handler to be installed first! Several other examples installed also use the HDF4 and HDF5 handlers.

Functionality

This version of the NcML Module implements a subset of NcML 2.2 functionality.

Our module can currently…

  • Refer only to files being served locally (not remotely)
  • Add, modify, and remove attribute metadata to a dataset
  • Create a purely virtual dataset using just NcML and no underlying dataset
  • Create new scalar variables of any simple NcML type or simple DAP type
  • Create new Structure variables (which can contain new child variables)
  • Create new N-dimensional arrays of simple types (NcML or DAP)
  • Remove existing variables from a wrapped dataset
  • Rename existing variables in a wrapped dataset
  • Name dimensions as a mnemonic for specifying Array shapes
  • Perform union aggregations on multiple datasets, virtual or wrapped or both
  • Perform joinNew aggregations to merge a variable across multiple datasets by creating a new outer dimension
  • Specify aggregation member datasets by scanning directories for files matching certain criteria

We describe each supported NcML element in detail below.

<netcdf> Element

The <netcdf> element is used to define a dataset, either a wrapped dataset that is to be modified, a pure virtual dataset, or a member dataset of an aggregation. The <netcdf> element is assumed to be the topmost node, or as a child of an aggregation element.

Local vs. Remote Datasets

The location attribute (netcdf@location) can be used to reference either local or remote files. If the value of netcdf@location does not begin with the string http then the value is interpreted as a path to dataset relative to the BES data root directory. However, if the value of the netcdf@location attributte begins with http then the value is treated as a URL and the Gateway System is used to access the remote data. As a result any URL used in an netcdf@location attribute value must match one of the Gateway.Whitelist expressions in the bes.conf stack.

If the value of netcdf@location is the empty string (or unspecified, as empty is the default), the dataset is a pure virtual dataset, fully specified within the NcML file itself. Attributes and variables may be fully described and accessed with constraints just as normal datasets in this manner. The installed sample datafile "sample_virtual_dataset.ncml" is an example test case for this functionality.

Unsupported Attributes

The current version does not support the following attributes of <netcdf>:

  • enhance
  • addRecords
  • fmrcDefinition (will be supported when FMRC aggregation is added)

<readMetadata> Element

The <readMetadata/> element is the default, so is effectively not needed.

<explicit> element

The <explicit/> element simply clears all attribute tables in the referred to netcdf@location before applying the rest of the NcML transformations to the metadata.

<dimension> Element

The <dimension> element has limited functionality in this release since the DAP2 doesn’t support dimensions as more than mnemonics at this time. The limitations are:

  • We only parse the dimension@name and dimension@length attributes.
  • Dimensions can only be specified as a direct child of a <netcdf> element prior to any reference to them

For example…

<netcdf>
  <dimension name="station" length="2"/>
  <dimension name="samples" length="5"/>
  <!-- Some variable elements refer to the dimensions here -->
</netcdf>

The dimension element sets up a mapping from the name to the unsigned integer length and can be used in a variable@shape to specify a length for an array dimension (see the section on <variable> below). The dimension map is cleared when </netcdf> is encountered (though this doesn’t matter currently since we allow only one right now, but it will matter for aggregation, potentially). We also do not support <group>, which is the only other legal place in NcML 2.2 for a dimension element.

Parse Errors:

  • If the name and length are not both specified.
  • If the dimension name already exists in the current scope
  • If the length is not an unsigned integer
  • If any of the other attributes specified in NcML 2.2 are used. We do not handle them, so we consider them errors now.

<variable> Element

The <variable> element is used to:

  • Provide lexical scope for a contained <attribute> or <variable> element
  • Rename existing variables
  • Add new scalar variables of simple types
  • Add new Structure variables
  • Add new N-dimensional Array’s of simple types
  • Specify the coordinate variable for the new dimension in a joinNew aggregation

We describe each in turn in more detail.

Information Icon When working with an existing variable (array or otherwise) it is not required that the variable type be specified in it' NcML declaration. All that is needed is the correct name (in lexical scope). When specifying the type for an existing variable care must be taken to ensure that the type specified in the NcML document matches the type of the existing variable. In particular, variables that are arrays must be called array, and not the type of the template primitive.

Specifying Lexical Scope with <variable type="">

Consider the following example:

  <variable name="u">
    <attribute name="Metadata" type="string">This is metadata!</attribute>
  </variable>

This code assumes that a variable named "u" exists (of any type since we do not specify) and provides the lexical scope for the attribute "Metadata" which will be added or modified within the attribute table for the variable "u" (it’s qualified name would be "u.Metadata").

Nested DAP Structure and Grid Scopes

Scoping variable elements may be nested if the containing variable is a Structure (this includes the special case of Grid)

 <variable name="DATA_GRANULE" type="Structure">
    <variable name="PlanetaryGrid" type="Structure">
      <variable name="percipitate">
    <attribute name="units" type="String" value="inches"/>
      </variable>
    </variable>
  </variable>

This adds a "unit" attribute to the variable "percipitate" within the nested Structure’s ("DATA_GRANULE.PlanetaryGrid.percipitate" as fully qualified name). Note that we must refer to the type explicitly as a "Structure" so the parser knows to traverse the tree.

Information Icon The variable might be of type Grid, but the type "Structure" must be used in the NcML to traverse it.

Adding Multiple Attributes to the Same Variable

Once the variable’s scope is set by the opening <variable> element, more than one attribute can be specified within it. This will make the NcML more readable and also will make the parsing more efficient since the variable will only need to be looked up once.

For example…

<variable name="Foo">
   <attribute name="Attr_1" type="string" value="Hello"/>
   <attribute name="Attr_2" type="string" value="World!"/>
</variable>

…should be preferred over…

<variable name="Foo">
   <attribute name="Attr_1" type="string" value="Hello"/>
</variable>
<variable name="Foo">
   <attribute name="Attr_2" type="string" value="World!"/>
</variable>

…although they produce the same result. Any number of attributes can be specified before the variable is closed.

Renaming Existing Variables

The attribute variable@orgName is used to rename an existing variable.

For example…

<variable name="NewName" orgName="OldName"/>

…will rename an existing variable at the current scope named "OldName" to "NewName". After this point in the NcML file (such as in constraints specified for the DAP request), the variable is known by "NewName".

Note that the type is not required here --- the variable is assumed to exist and its existing type is used. It is not possible to change the type of an existing variable at this time!

Parse Errors:

  • If a variable with variable@orgName doesn’t exist in the current scope
  • If the new name variable@name is already taken in the current scope
  • If a new variable is created but does not have exactly one values element

Adding a New Scalar Variable

The <variable> element can be used to create a new scalar variable of a simple type (i.e. an atomic NcML type such as "int" or "float", or any DAP atomic type, such as "UInt32" or "URL") by specifying an empty variable@shape (which is the default), a simple type for variable@type, and a contained <values> element with the one value of correct type.

For example…

<variable name="TheAnswerToLifeTheUniverseAndEverything" type="double">
    <attribute name="SolvedBy" type="String" value="Deep Thought"/>
    <values>42.000</values>
  </variable>

…will create a new variable named "TheAnswerToLifeTheUniverseAndEverything" at the current scope. It has no shape so will be a scalar of type "double" and will have the value 42.0.

Parse Errors:

  • It is a parse error to not specify a <values> element with exactly one proper value of the variable type.
  • It is a parse error to specify a malformed or out of bounds value for the data type

Adding a New Structure Variable

A new Structure variable can be specified at the global scope or within another Structure. It is illegal for an array to have type structure, so the shape must be empty.

For example…

<variable name="MyNewStructure" type="Structure">
    <attribute name="MetaData" type="String" value="This is metadata!"/>
    <variable name="ContainedScalar1" type="String"><values>I live in a new structure!</values></variable>
    <variable name="ContainedInt1" type="int"><values>42</values></variable>
  </variable>

…specifies a new structure called "MyNewStructure" which contains two scalar variable fields "ContainedScalar1" and "ContainedInt1".

Nested structures are allowed as well.

Parse Error:

  • If another variable or attribute exists at the current scope with the new name.
  • If a <values> element is specified as a direct child of a new Structure --- structures cannot contain values, only attributes and other variables.

Adding a New N-dimensional Array

An N-dimensional array of a simple type may be created virtually as well by specifying a non-empty variable@shape. The shape contains the array dimensions in left-to-right order of slowest varying dimension first. For example…

 <variable name="FloatArray" type="float" shape="2 5">
      <!-- values specified in row major order (leftmost dimension in shape varies slowest)
    Any whitespace is a valid separator by default, so we can use newlines to pretty print 2D matrices.
    -->
      <values>
    0.1 0.2 0.3 0.4 0.5
    1.1 1.1 1.3 1.4 1.5
      </values>
    </variable>

…will specify a 2x5 dimension array of float values called "FloatArray". The <values> element must contain 2x5=10 values in row major order (slowest varying dimension first). Since whitespace is the default separator, we use a newline to show the dimension boundary for the values, which is easy to see for a 2D matrix such as this.

A dimension name may also be used to refer mnemonically to a length. The DAP response will use this mnemonic in its output, but it is not currently used for shared dimensions, only as a mnemonic. See the section on the <dimension> element for more information. For example…

<netcdf>
 <dimension name="station" length="2"/>
 <dimension name="sample" length="5"/>
 <variable name="FloatArray" type="float" shape="station sample">
      <values>
    0.1 0.2 0.3 0.4 0.5
    1.1 1.1 1.3 1.4 1.5
      </values>
    </variable>

…will produce the same 2x5 array, but will incorporate the dimension mnemonics into the response. For example, here’s the DDS response:

Dataset {
     Float32 FloatArray[station = 2][samples = 5];
} sample_virtual_dataset.ncml;

Note that the <values> element respects the values@separator attribute if whitespace isn’t correct. This is very useful for arrays of strings with whitespace, for example…

<variable name="StringArray" type="string" shape="3">
  <values separator="*">String 1*String 2*String 3</values>
</variable>

…creates a length 3 array of string StringArray = \{"String 1", "String 2", "String 3"}.

Parse Errors:

  • It is an error to specify the incorrect number of values
  • It is an error if any value is malformed or out of range for the data type.
  • It is an error to specify a named dimension which does not exist in the current <netcdf> scope.
  • It is an error to specify an Array whose flattened size (product of dimensions) is > 2^31-1.

Specifying the New Coordinate Variable for a joinNew Aggregation

In the special case of a joinNew aggregation, the new coordinate variable may be specified with the <variable> element. The new coordinate variable is defined to have the same name as the new dimension. This allows for several things:

  • Explicit specification of the variable type and coordinates for the new dimension
  • Specification of the metadata for the new coordinate variable

In the first case, the author can specify explicitly the type of the new coordinate variable and the actual values for each dataset. In this case, the variable must be specified after the aggregation element in the file so the new dimension’s size (number of member datasets) may be known and error checking performed. Metadata can also be added to the variable here.

In the second case, the author may just specify the variable name, which allows one to specify the metadata for a coordinate variable that is automatically generated by the aggregation itself. This is the only allowable case for a variable element to not contain a values element! Coordinate variables are generated automatically in two cases:

  • The author has specified an explicit list of member datasets, with or without explicit coordVal attributes.
  • The author has used a <scan> element to specify the member datasets via a directory scan

In this case, the <variable> element may come before or after the <aggregation>.

Parse Errors:

  • If an explicit variable is declared for the new coordinate variable:
    • And it contains explicit values, the number of values must be equal to the number of member datasets in the aggregation.
    • It must be specifed after the <aggregation> element
  • If a numeric coordVal is used to specify the first member dataset’s coordinate, then all datasets must contain a numerical coordinate.
  • An error is thrown if the specified aggregation variable (variableAgg) is not found in all member datasets.
  • An error is thrown if the specified aggregation variable is not of the same type in all member datasets. Coercion is notperformed!
  • An error is thrown if the specified aggregation variables in all member datasets do not have the same shape
  • An error is thrown if an explicit coordinate variable is specified with a shape that is not the same as the new dimension name (and the variable name itself).

<values> Element

The <values> element can only be used in the context of a new variable of scalar or array type. We cannot change the values for existing variables in this version of the handler. The characters content of a <values> element is considered to be a separated list of value tokens valid for the type of the variable of the parent element. The number of specified tokens in the content must equal the product of the dimensions of the enclosing variable@shape, or be one value for a scalar. It is an error to not specify a <values> element for a declared new variable as well.

Changing the Separator Tokens

The author may specify values@separator to change the default value token separator from the default whitespace. This is very useful for specifying arrays of strings with whitespace in them, or if data in CSV form is being pasted in.

Autogeneration of Uniform Arrays

We also can parse values@start and values@increment INSTEAD OF tokens in the content. This will "autogenerate" a uniform array of values of the given product of dimensions length for the containing variable. For example:

<variable name="Evens" type="int" shape="100">
  <values start="0" increment="2"/>
</variable>

will specify an array of the first 100 even numbers (including 0).

Parse Errors:

  • If the incorrect number of tokens are specified for the containing variable’s shape
  • If any value token cannot be parsed as a valid value for the containing variable’s type
  • If content is specified in addition to start and increment
  • If only one of start or increment is specified
  • If the values element is placed anywhere except within a NEW variable.

<attribute> Element

As an overview, whenever the parser encounters an <attribute> with a non-existing name (at the current scope), it creates a new one, whether a container or atomic attribute (see below). If the attribute exists, its value and/or type is modified to those specified in the <attribute> element. If an attribute structure (container) exists, it is used to define a nested lexical scope for child attributes.

Attributes may be scalar (one value) or one dimensional arrays. Arrays are specified by using whitespace (default) to separate the different values. The attribute@separator may also be set in order to specify a different separator, such as CSV format or to specify a non-whitespace separator so strings with whitespace are not tokenized. We will give examples of creating array attributes below.

Adding New Attributes or Modifying an Existing Attribute

If a specified attribute with the attribute@name does not exist at the current lexical scope, a new one is created with the given type and value. For example, assume "new_metadata" doesn’t exist at the current parse scope. Then…

<attribute name="new_metadata" type="string" value="This is a new entry!"/>

…will create the attribute at that scope. Note that value can be specified in the content of the element as well. This is identical to the above:

<attribute name="new_metadata" type="string">This is a new entry!</attribute>

If the attribute@name already exists at the scope, it is modified to contain the specified type and value.

Arrays

As in NcML, for numerical types an array can be specified by separating the tokens by whitespace (default) or be specifying the token separator with attribute@separator. For example…

<attribute name="myArray" type="int">1 2 3</attribute>

…and…

<attribute name="myArray" type="int" separator=",">1,2,3</attribute>

…both specify the same array of three integers named "myArray".

TODO Add more information on splitting with a separator!

Structures (Containers)

We use attribute@type="Structure" to define a new (or existing) attribute container. So if we wanted to add a new attribute structure, we’d use something like this:

  <attribute name="MySamples" type="Structure">
    <attribute name="Location" type="string" value="Station 1"/>
    <attribute name="Samples" type="int">1 4 6</attribute>
  </attribute>

Assuming "MySamples" doesn’t already exist, an attribute container will be created at the current scope and the "Location" and "Samples" attributes will be added to it.

Note that we can create nested attribute structures to arbitrary depth this way as well.

If the attribute container with the given name already exists at the current scope, then the attribute@type="Structure" form is used to define the lexical scope for the container. In other words, child <attribute> elements will be processed within the scope of the container. For example, in the above example, if "MySamples" already exists, then the "Location" and "Samples" will be processed within the existing container (they may or may not already exist as well).

Renaming an Existing Attribute or Attribute Container

We also support the attribute@orgName attribute for renaming attributes.

For example…

<attribute name="NewName" orgName="OldName" type="string"/>

will rename an existing attribute "OldName" to "NewName" while leaving its value alone. If attribute@value is also specified, then the attribute is renamed and has its value modified.

This works for renaming attribute containers as well:

<attribute name="MyNewContainer" orgName="MyOldContainer" type="Structure"/>

…will rename an existing "MyOldContainer" to "MyNewContainer". Note that any children of this container will remain in it.

DAP OtherXML Extension

The module now allows specification of attributes of the new DAP type "OtherXML". This allows the NCML file author to inject arbitrary well-formed XML into an attribute for clients that want XML metadata rather than just string or url. Internally, the attribute is still a string (and in a DAP DAS response will be quoted inside one string). However, since it is XML, the NCMLParser still parses it and checks it for well-formedness (but NOT against schemas). This extension allows the NCMLParser to parse the arbitrary XML within the given attribute without causing errors, since it can be any XML.

The injected XML is most useful in the DDX response, where it shows up directly in the response as XML. XSLT and other clients can then parse it.

Errors

  • The XML must be in the content of the <attribute type="OtherXML"> element. It is a parser error for attribute@value to be set if attribute@type is "OtherXML".
  • The XML must also be well-formed since it is parsed. A parse error will be thrown if the OtherXML is malformed.

Example

Here’s an example of the use of this special case:

<netcdf xmlns="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2" location=
"/coverage/200803061600_HFRadar_USEGC_6km_rtv_SIO.nc">
    <attribute name="someName" type="OtherXML">
        <Domain xmlns="http://www.opengis.net/wcs/1.1"
                xmlns:ows="http://www.opengis.net/ows/1.1"
                xmlns:gml="http://www.opengis.net/gml/3.2"
                >
            <SpatialDomain>
                <ows:BoundingBox crs="urn:ogc:def:crs:EPSG::4326">
                    <ows:LowerCorner>-97.8839 21.736</ows:LowerCorner>
                    <ows:UpperCorner>-57.2312 46.4944</ows:UpperCorner>
                </ows:BoundingBox>
            </SpatialDomain>
            <TemporalDomain>
                <gml:timePosition>2008-03-27T16:00:00.000Z</gml:timePosition>
            </TemporalDomain>
        </Domain>
        <SupportedCRS xmlns="http://www.opengis.net/wcs/1.1">urn:ogc:def:crs:EPSG::4326</SupportedCRS>
        <SupportedFormat xmlns="http://www.opengis.net/wcs/1.1">netcdf-cf1.0</SupportedFormat>
        <SupportedFormat xmlns="http://www.opengis.net/wcs/1.1">dap2.0</SupportedFormat>
    </attribute>
</netcdf>

TODO: Put the DDX response for the above in here!

Namespace Closure

Furthermore, the parser will make the chunk of OtherXML "namespace closed". This means any namespaces specified in parent NCML elements of the OtherXML tree will be "brought down" and added to the root OtherXML elements so that the subtree may be pulled out and added to the DDX and still have its namespaces. The algorithm doesn’t just bring used prefixes, but brings all of the lexically scoped closest namespaces in all ancestors. In other words, it adds unique namespaces (as determined by prefix) in order from the root of the OtherXML tree as it traverses to the root of the NCML document.

Namespace closure is a syntactic sugar that simplifies the author’s task since they can specify the namespaces just once at the top of the NCML file and expect that when the subtree of XML is added to the DDX that these namespaces will come along with that subtree of XML. Otherwise they have to explicitly add the namespaces to each attributes.

TODO Add an example!

<remove> Element

The <remove> element can remove attributes and variables. For example…

  <attribute name="NC_GLOBAL" type="Structure">
    <remove name="base_time" type="attribute"/>
  </attribute>

…will remove the attribute named "base_time" in the attribute structure named "NC_GLOBAL".

Note that this works for attribute containers as well. We could recursively remove the entire attribute container (i.e. it and all its children) with:

 <remove name="NC_GLOBAL" type="attribute"/>

It also can be used to remove variables from existing datasets:

  <remove name="SomeExistingVariable" type="variable"/>

This also recurses on variables of type Structure --- the entire structure including all of its children are removed from the dataset’s response.

Parse Errors:

  • It is a parse error if the given attribute or variable doesn’t exist in the current scope

<aggregation> Element

Information Icon The syntax used by Hyrax is slightly different from the THREDDS Data Server (TDS). In particular, we do not process the <aggregation> element prior to other elements in the dataset, so in some cases the relative ordering of the <aggregation> and references to variables within the aggregation matters.

Aggregation involves combining multiple datasets (<netcdf>) into a virtual "single" dataset in various ways. For a tutorial on aggregation in NcML 2.2, the reader is referred to the Unidata page:http://www.unidata.ucar.edu/software/netcdf/ncml/v2.2/Aggregation.html

NcML 2.2 supports multiple types of aggregation: union, joinNew, joinExisting, and fmrc (forecast model run collection).

The current version of the NcML module supports two of these aggregations:

A union aggregation specifies that the first instance of a variable or attribute (by name) that is found in the ordered list of datasets will be the one in the output aggregation. This is useful for combining two dataset files, each which may contain a single variable, into a composite dataset with both variables.

A JoinNew aggregation joins a variable which exists in multiple datasets (usually samples of a datum over time) into a new variable containing the data from all member datasets by creating a new outer dimension. The ith component in the new outer dimension is the variable’s data from the ith member dataset. It also adds a new coordinate variable of whose name is the new dimension’s name and whose shape (length) is the new dimension as well. This new coordinate variable may be explicitly given by the author or may be autogenerated in one of several ways.

<scan> Element

The scan element can be used within an aggregation context to allow a directory to be searched in various ways in order to specify the members of an aggregation. This allows a static NcML file to refer to an aggregation which may change over time, such as where a new data file is generated each day.

We describe usage of the <scan> element in detail in the joinNew aggregation tutorial here.

Errors

There are three types of error messages that may be returned:

  • Internal Error
  • Resource Not Found Error
  • Parse Error

Internal Errors

Internal errors should be reported to support@opendap.org as they are likely bugs.

Resource Not Found Errors

If the netcdf@location specifies a non-existent local dataset (one that is not being served by the same Hyrax server), it will specify the resource was not found. This may also be returned if a handler for the specified dataset is not currently loaded in the BES. Users should test that the dataset to be wrapped already exists and can be viewed on the running server before writing NcML to add metadata. It’s also an error to refer to remote datasets (at this time).

Parse Errors

Parse errors are user errors in the NcML file. These could be malformed XML, malformed NcML, unimplemented features of NcML, or could be errors in referring to the wrapped dataset.

The error message should specify the error condition as well as the "current scope" as a fully qualified DAP name within the loaded dataset. This should be enough information to correct the parse error as new NcML files are created.

The parser will generate parse errors in various situations where it expects to find certain structure in the underlying dataset. Some examples:

  • A variable of the given name was not found at the current scope
  • attribute@orgName was specified, but the attribute cannot be found at current scope.
  • attribute@orgName was specified, but the new name is already used at current scope.
  • remove specified a non-existing attribute name
Additions/Changes to NcML 2.2

This section will keep track of changes to the NcML 2.2 schema. Eventually these will be rolled into a new schema.

Attribute Structures (Containers)

This module also adds functionality beyond the current NcML 2.2 schema --- it can handle nested <attribute> elements in order to make attribute structures. This is done by using the <attribute type="Structure"> form, for example:

  <attribute name="MySamples" type="Structure">
    <attribute name="Location" type="string" value="Station 1"/>
    <attribute name="Samples" type="int">1 4 6</attribute>
  </attribute>

"MyContainer" describes an attribute structure with two attribute fields, a string "Location" and an array of int’s called "Samples". Note that an attribute structure of this form can only contain other <attribute> elements and NOT a value.

If the container does not already exist, it will be created at the scope it is declared, which could be:

  • Global (top of dataset)
  • Within a variable’s attribute table
  • Within another attribute container

If an attribute container of the given name already exists at the lexical scope, it is traversed in order to define the scope for the nested (children) attributes it contains.

Unspecified Variable Type Matching for Lexical Scope

We also allow the type attribute of a variable element (variable@type) to be the empty string (or unspecified) when using existing variables to define the lexical scope of an <attribute> transformation. In the schema, variable@type is (normally) required.

DAP 2 Types

Additionally, we allow DAP2 atomic types (such as UInt32, URL) in addition to the NcML types. The NcML types are mapped onto the closest DAP2 type internally.

DAP OtherXML Attribute Type

We also allow attributes to be of the new DAP type "OtherXML" for injecting arbitrary XML into an attribute as content rather than trying to form a string. This allows the parser to check well-formedness.

Forward Declaration of Dimensions

Since we use a SAX parser for efficiency, we require the <dimension> elements to come before their use in a variable@shape. One way to change the schema to allow this is to force the dimension elements to be specified in a sequence after explicit and metadata choice and before all other elements.

Aggregation Element Location and Processing Order Differences

NcML specifies that if a dataset (<netcdf> element) specifies an aggregation element, the aggregation element is always processed first, regardless of its ordering within the <netcdf> element. Our parser, since it is SAX and not DOM, modifies this behavior in that order matters in some cases:

  • Metadata (<attribute>) elements specified prior to an aggregation "shadow" the aggregation versions. This is be useful for "overriding" an attribute or variable in a union aggregation, where the first found will take precedence.
  • JoinNew: If the new coordinate variable’s data is to be set explicitly by specifying the new dimension’s shape (either with explicit data or the autogenerated data using values@start and values@increment attributes), the <variable> mustcome after the aggregation since the size of the dimension is unknown until the aggregation element is processed.
Backward Compatibility Issues

Due to the way shared dimensions were implemented in the NetCDF, HDF4, and HDF5 handlers, the DAS responses did not follow the DAP2 specification. The NcML module, on the other hand, generates DAP2 compliant DAS for these datasets, which means that wrapping some datasets in NcML will generate a DAS with a different structure. This is important for the NcML author since it changes the names of attributes and variables. In order for the module to find the correct scope for adding metadata, for example, the DAP2 DAS must be used.

In general, what this means is that an empty "passthrough" NcML file should be the starting point for authoring an NcML file. This file would just specify a dataset and nothing else:

<netcdf location="/data/ncml/myNetcdf.nc"/>

The author would then request the DAS response for the NCML file and use that as the starting point for modifications to the original dataset.

More explicit examples are given below.

NetCDF

The NetCDF handler represents some NC datasets as a DAP 2 Grid, but the returned DAS is not consistent with the DAP 2 spec for the attribute hierarchy for such a Grid. The map vector attributes are placed as siblings of the grid attributes rather than within the grid lexical scope. For example, here’s the NetCDF Handler DDS for a given file:

Dataset {
    Grid {
      Array:
        Int16 cldc[time = 456][lat = 21][lon = 360];
      Maps:
        Float64 time[time = 456];
        Float32 lat[lat = 21];
        Float32 lon[lon = 360];
    } cldc;
} cldc.mean.nc;

…showing the Grid. Here’s the DAS the NetCDF handler generates…

Attributes {
    lat {
        String long_name "Latitude";
        String units "degrees_north";
        Float32 actual_range 10.00000000, -10.00000000;
    }
    lon {
        String long_name "Longitude";
        String units "degrees_east";
        Float32 actual_range 0.5000000000, 359.5000000;
    }
    time {
        String units "days since 1-1-1 00:00:0.0";
        String long_name "Time";
        String delta_t "0000-01-00 00:00:00";
        String avg_period "0000-01-00 00:00:00";
        Float64 actual_range 715511.00000000000, 729360.00000000000;
    }
    cldc {
        Float32 valid_range 0.000000000, 8.000000000;
        Float32 actual_range 0.000000000, 8.000000000;
        String units "okta";
        Int16 precision 1;
        Int16 missing_value 32766;
        Int16 _FillValue 32766;
        String long_name "Cloudiness Monthly Mean at Surface";
        String dataset "COADS 1-degree Equatorial Enhanced\\012AI";
        String var_desc "Cloudiness\\012C";
        String level_desc "Surface\\0120";
        String statistic "Mean\\012M";
        String parent_stat "Individual Obs\\012I";
        Float32 add_offset 3276.500000;
        Float32 scale_factor 0.1000000015;
    }
    NC_GLOBAL {
        String title "COADS 1-degree Equatorial Enhanced";
        String history "";
        String Conventions "COARDS";
    }
    DODS_EXTRA {
        String Unlimited_Dimension "time";
    }
}

Note the map vector attributes are in the "dataset" scope.

Here’s the DAS that the NcML Module produces from the correctly formed DDX:

Attributes {
    NC_GLOBAL {
        String title "COADS 1-degree Equatorial Enhanced";
        String history "";
        String Conventions "COARDS";
    }
    DODS_EXTRA {
        String Unlimited_Dimension "time";
    }
    cldc {
        Float32 valid_range 0.000000000, 8.000000000;
        Float32 actual_range 0.000000000, 8.000000000;
        String units "okta";
        Int16 precision 1;
        Int16 missing_value 32766;
        Int16 _FillValue 32766;
        String long_name "Cloudiness Monthly Mean at Surface";
        String dataset "COADS 1-degree Equatorial Enhanced\\012AI";
        String var_desc "Cloudiness\\012C";
        String level_desc "Surface\\0120";
        String statistic "Mean\\012M";
        String parent_stat "Individual Obs\\012I";
        Float32 add_offset 3276.500000;
        Float32 scale_factor 0.1000000015;
        cldc {
        }
        time {
            String units "days since 1-1-1 00:00:0.0";
            String long_name "Time";
            String delta_t "0000-01-00 00:00:00";
            String avg_period "0000-01-00 00:00:00";
            Float64 actual_range 715511.00000000000, 729360.00000000000;
        }
        lat {
            String long_name "Latitude";
            String units "degrees_north";
            Float32 actual_range 10.00000000, -10.00000000;
        }
        lon {
            String long_name "Longitude";
            String units "degrees_east";
            Float32 actual_range 0.5000000000, 359.5000000;
        }
    }
}

Here the Grid Structure "cldc" and its contained data array (of the same name "cldc") and map vectors have their own attribute containers as DAP 2 specifies.

What this means for the author of an NcML file adding metadata to a NetCDF dataset that returns a Grid is that they should generate a "passthrough" file and get the DAS and then specify modifications based on that structure.

Here’s an example passthrough:

<netcdf location="data/ncml/agg/cldc.mean.nc" title="This file results in a Grid">
</netcdf>

For example, to add an attribute to the map vector "lat" in the above, we’d need the following NcML:

<netcdf location="data/ncml/agg/cldc.mean.nc" title="This file results in a Grid">
  <!-- Traverse into the Grid as a Structure -->
  <variable name="cldc" type="Structure">
    <!-- Traverse into the "lat" map vector (Array) -->
    <variable name="lat">
      <attribute name="Description" type="string">I am a new attribute in the Grid map 
       vector named lat!</attribute>
    </variable>
    <variable name="lon">
      <attribute name="Description" type="string">I am a new attribute in the Grid map 
       vector named lon!</attribute>
    </variable>
  </variable>
</netcdf>

This clearly shows that the structure of the Grid must be used in the NcML: the attribute being added is technically "cldc.lat.Description" in a fully qualified name. The parser would return an error if it was attempted as "lat.Description" as the NetCDF DAS for the original file would have led one to believe.

HDF4/HDF5

Similarly to the NetCDF case, the Hyrax HDF4 Module produces DAS responses that do not respect the DAP2 specification. If an NcML file is used to "wrap" an HDF4 dataset, the correct DAP2 DAS response will be generated, however.

This is important for those writing NcML for HDF4 data since the lexical scope for attributes relies on the correct DAS form --- to handle this, the user should start with a "passthrough" NcML file (see the above NetCDF example) and use the DAS from that as the starting point for knowing the structure the NcML handler expects to see in the NcML file. Alternatively, the DDX has the proper attribute structure as well (the DAS is generated from it).

Known Bugs

There are no known bugs currently.

Planned Enhancements

Planned enhancements for future versions of the module include…

11.E.2. JoinNew Aggregation
Introduction

A joinNew aggregation joins existing datasets along a new outer Array dimension. Essentially, it adds a new index to the existing variable which points into the values in each member dataset. One useful example of this aggregation is for joining multiple samples of data from different times into one virtual dataset containing all the times. We will first provide a basic introduction to the joinNew aggregation, then demonstrate examples for the various ways to specify the members datasets of an aggregation, the values for the new dimension’s coordinate variable (map vector), and ways to specify metadata for this aggregation.

The reader is also directed to a basic tutorial of this NcML aggregation which may be found athttp://www.unidata.ucar.edu/software/thredds/current/netcdf-java/ncml/Aggregation.html#joinNew

A joinNew aggregation combines a variable with data across n datasets by creating a new outer dimension and placing the data from aggregation member i into the element i of the new outer dimension of size n. By "outer dimension" we mean a slowest varying dimension in a row major order flattening of the data (an example later will clarify this). For example, the array A[day][sample] would have the day dimension as the outer dimension. The data samples all must have the same data syntax; specifically the DDS of the variables must all match. For example, if the aggregation variable has namesample and is a 10x10 Array of float32, then all the member datasets in the aggregation must include a variable named sample which are all also 10x10 Arrays of float32. If there were 100 datasets specified in the aggregation, the resulting DDS would contain a variable named sample that was now of data shape 100x10x10.

In addition, a new coordinate variable specifying data values for the new dimension will be created at the same scope as (a sibling of) the specified aggregation variable. For example, if the new dimension is called "filename" and the new dimension’s values are unspecified (the default), then an Array of type String will be created with one element for each member dataset --- the filename of the dataset. Additionally, if the aggregation variable was represented as a DAP Grid, this new dimension coordinate variable will also be added as a new Map vector inside the Grid to maintain the Grid specification.

There are multiple ways to specify the member datasets of a joinNew aggregation:

  • Explicit: Specifying a separate <netcdf> element for each dataset
  • Scan: scan a directory tree for files matching a conjunction of certain criteria:
    • Specific suffix
    • Older than a specific duration
    • Matching a specific regular expression
    • Either in a specific directory or recursively searching subdirectories

Additionally, there are multiple ways to specify the new coordinate variable’s (the new outer dimension’s associated data variable) data values:

  • Default: An Array of type String containing the filenames of the member datasets
  • Explicit Value Array: Explicit list of values of a specific data type, exactly one per dataset
  • Dynamic Array: a numeric Array variable specified using start and increment values — one value is generated automatically per dataset
  • Timestamp from Filename: An Array of String with values of ISO 8601 Timestamps extracted from scanned dataset filenames using a specified Java SimpleDataFormat string. (Only works with <scan> element!)
A Simple Self-Contained Example

First, we start with a simple purely virtual (no external datasets) example to give you a basic idea of this aggregation. This example will join two one-dimensional Arrays of int’s of length 5. The variable they describe will be called V. In this example, we assume we are joining samples of some variable V where each dataset is samples from 5 stations on a single day. We want to join the datasets so the new outer dimension is the day, resulting in a 2x5 array of int values for V.

Here’s our NcML, with comments to describe what we are doing:

<?xml version="1.0" encoding="UTF-8"?>
<!-- A simple pure virtual joinNew aggregation of type Array<int>[5][2]  -->
<netcdf title="Sample joinNew Aggregation on Pure NCML Datasets">
  <!-- joinNew forming new outer dimension "day" -->
  <aggregation type="joinNew" dimName="day">
    <!-- For variables with this name in child datasets -->
    <variableAgg name="V"/>
    <!-- Datasets are one-dimensional Array<int> with cardinality 5. -->
    <netcdf title="Sample Slice 1">
      <!-- Must forward declare the dimension size -->
      <dimension name="station" length="5"/>
      <variable name="V" type="int" shape="station">
    <values>1 3 5 7 9</values>
      </variable>
    </netcdf>
    <!-- Second slice must match shape! -->
    <netcdf title="Sample Slice 2">
      <dimension name="station" length="5"/>
      <variable name="V" type="int" shape="station">
    <values>2 4 6 8 10</values>
      </variable>
    </netcdf>
  </aggregation>
<!-- This is what the expected output aggregation will look like.
       We can use the named dimensions for the shape here since the aggregation
       comes first and the dimensions will be added to the parent dataset by now -->
  <variable name="V_expected" type="int" shape="day station">
    <!-- Row major values.  Since we create a new outer dimension, the slices are concatenated
        since the outer dimension varies the slowest in row major order.  This gives a 2x5 Array.
     We use the newline to show the dimension separation for the reader's benefit -->
    <values>
      1 3 5 7 9
      2 4 6 8 10
    </values>
  </variable>
</netcdf>

Notice that we specify the name of the aggregation variable V inside the aggregation using a <variableAgg> element --- this allows to to specify multiple variables in the datasets to join. The new dimension, however, is specified by the attribute dimName of <aggregation>. We do NOT need to specify a <dimension> element for the new dimension (in fact, it would be an error to do so). Its size is calculated based on the number of datasets in the aggregation.

Running this file through the module produces the following DDS:

Dataset {
    Int32 V[day = 2][station = 5];
    Int32 V_expected[day = 2][station = 5];
    String day[day = 2];
} joinNew_virtual.ncml;

Notice how the new dimension caused a coordinate variable to be created with the same name and shape as the new dimension. This array will contain the default values for the new outer dimension’s map as we shall see if we ask for the ASCII version of the DODS (data) response:

The data:
Int32 V[day = 2][station = 5] = {{1, 3, 5, 7, 9},{2, 4, 6, 8, 10}};
Int32 V_expected[day = 2][station = 5] = {{1, 3, 5, 7, 9},{2, 4, 6, 8, 10}};
String day[day = 2] = {"Virtual_Dataset_0", "Virtual_Dataset_1"};

We see that the resulting aggregation data matches what we expected to create, specified by our V_expected variable. Also, notice that the values for the coordinate variable are "Virtual_Dataset_i", where i is the number of the dataset. Since the datasets did not have the location attribute set (which would have been used if it was), the module generates unique names for the virtual datasets in the output.

We could also have specified the value for the dataset using the netcdf@coordValue attribute:

<?xml version="1.0" encoding="UTF-8"?>
<netcdf title="Sample joinNew Aggregation on Pure NCML Datasets">
    <aggregation type="joinNew" dimName="day">
    <variableAgg name="V"/>
    <netcdf title="Sample Slice 1" coordValue="100">
      <dimension name="station" length="5"/>
      <variable name="V" type="int" shape="station">
    <values>1 3 5 7 9</values>
      </variable>
    </netcdf>
    <netcdf title="Sample Slice 2" coordValue="107">
      <dimension name="station" length="5"/>
      <variable name="V" type="int" shape="station">
    <values>2 4 6 8 10</values>
      </variable>
    </netcdf>
  </aggregation>
</netcdf>

This results in the ASCII DODS of…

The data:
Int32 V[day = 2][station = 5] = {{1, 3, 5, 7, 9},{2, 4, 6, 8, 10}};
Float64 day[day = 2] = {100, 107};

Since the coordValue’s could be parsed numerically, the coordinate variable is of type double (Float64). If they could not be parsed numerically, then the variable would be of type String.

Now that the reader has an idea of the basics of the joinNew aggregation, we will create examples for the many different use cases the NcML aggregation author may wish to create.

A Simple Example Using Explicit Dataset Files

Using virtual datasets is not that common. More commonly, the aggregation author wants to specify files for the aggregation. As an introductory example of this, we’ll create a simple aggregation explicitly listing the files and giving string coordValue’s. Note that this is a contrived example: we are using the same dataset file for each member, but changing the coordValue’s. Also notice that we have specified that both the u and v variables be aggregated using the same new dimension name source.

<?xml version="1.0" encoding="UTF-8"?>
<netcdf title="joinNew Aggregation with explicit string coordValue.">
  <aggregation type="joinNew" dimName="source">
    <variableAgg name="u"/>
    <variableAgg name="v"/>
    <!-- Same dataset a few times, but with different coordVal -->
    <netcdf title="Dataset 1" location="data/ncml/fnoc1.nc" coordValue="Station_1"/>
    <netcdf title="Dataset 2" location="data/ncml/fnoc1.nc" coordValue="Station_2"/>
    <netcdf title="Dataset 3" location="data/ncml/fnoc1.nc" coordValue="Station_3"/>
  </aggregation>
</netcdf>

…which produces the DDS:

Dataset {
    Int16 u[source = 3][time_a = 16][lat = 17][lon = 21];
    Int16 v[source = 3][time_a = 16][lat = 17][lon = 21];
    Float32 lat[lat = 17];
    Float32 lon[lon = 21];
    Float32 time[time = 16];
    String source[source = 3];
} joinNew_string_coordVal.ncml;

Since there’s so much data we only show the new coordinate variable:

String source[source = 3] = {"Station_1", "Station_2", "Station_3"};

Also notice that other coordinate variables (lat, lon, time) already existed in the datasets along with the u and v arrays. Any variable that is not aggregated over (specified as an aggregationVar) is explicitly union aggregated (please seeNCML_Module_Aggregation_Union into the resulting dataset --- the first instance of every variable found in the order the datasets are listed is used.

Now that we’ve seen simple cases, let’s look at more complex examples.

Examples of Explicit Dataset Listings

In this section we will give several examples of joinNew aggregation with a static, explicit list of member datasets. In particular, we will go over examples of…

  • Default values for the new coordinate variable
  • Explicitly setting values of any type on the new coordinate variable
  • Autogenerating uniform numeric values for the new coordinate variable
  • Explicitly setting String or double values using the netcdf@coordValue attribute

There are several ways to specify values for the new coordinate variable of the new outer dimension. If String or double values are sufficient, the author may set the value for each listed dataset using the netcdf@coordValue attribute for each dataset. If another type is required for the new coordinate variable, then the author has a choice of specifying the entire new coordinate variable explicitly (which must match dimensionality of the aggregated dimension) or using the start/increment autogeneration <values> element for numeric, evenly spaced samples.

Please see the Join Explicit Dataset Tutorial.

Adding/Modifying Metadata on Aggregations

It is possible to add or modify metadata on existing or new variables in an aggregation. The syntax for these varies somewhat, so we give examples of the different cases. We will also give examples of providing metadata:

  • Adding/modifying metadata to the new coordinate variable
  • Adding/modifying metadata to the aggregation variable itself
  • Adding/modifying metadata to existing maps in an aggregated Grid

Metadata on Aggregations Tutorial.

Dynamic Aggregations Using Directory Scanning

A powerful way to create dynamic aggregations (rather than by listing datasets explicitly) is by specifying a data directory where aggregation member datasets are stored and some criteria for which files are to be added to the aggregation. These criteria will be combined in a conjunction (an AND operator) to handle various types of searches. The way to specify datasets in an aggregation is by using the <scan> element inside the <aggregation> element.

A key benefit of using the <scan> element is that the NcML file need not change as new datasets are added to the aggregation, say by an automated process which simply writes new data files into a specific directory. By properly specifying the NcML aggregation with a scan, the same NcML will refer to a dynamically changing aggregation, staying up to date with current data, without the need for modifications to the NcML file itself. If the filenames have a timestamp encoded in them, the use of the dateFormatMark allows for automatic creation of the new coordinate variable data values as well, as shown below.

The scan element may be used to search a directory to find files that match the following criteria:

  • Suffix : the aggregated files end in a specific suffix, indicating the file type
  • Subdirectories: any subdirectories of the given location are to be searched and all regular files tested against the criteria
  • Older Than: the aggregated files must have been modified longer than some duration ago (to exclude files that may be currently being written)
  • Reg Exp: the aggregated file pathnames must match a specific regular expression
  • Date Format Mark: this highly useful criterion, useful in conjunction with others, allows the specification of a pattern in the filename which encodes a timestamp. The timestamp is extracted from the filenames using the pattern and is used to create ISO 8601 date elements for the new dimension’s coordinate variable.

We will give examples of each of these criteria in use in our tutorial. Again, if more than one is specified, then ALL must match for the file to be included in the aggregation.

11.E.3. JoinExisting Aggregation
Introduction

A joinExisting aggregation joins multiple granule datasets by concatenating the specified outer dimensional data from the granules into the output. This results in matrices of the same number of dimensions, but with larger outer dimension cardinality. The outer dimension sizes of the granules may vary across granule, but any inner dimensions for multi-dimensional data still are required to match.

The reader is also directed to a basic tutorial of this NcML aggregation which may be found athttp://www.unidata.ucar.edu/software/netcdf/ncml/v2.2/Aggregation.html#joinExisting. Note that version 1.1.0 of the module does not support all features of joinExisting.

Content Summary

This section describes the behavior of the initial implementation of joinExisting for version 1.2.x of the NcML Module, bundled with Hyrax 1.8. It is a limited feature set described below. Please see the Limitations section for more information.

In version 1.2.x, a joinExisting aggregation may be specified in three ways:

  • Using explicit lists of netcdf elements with the the ncoords attribute correctly specified for all of them.
  • Leaving off the ncoords attribute for all of the netcdf elements.
  • Using a scan element with ncoords specified and all matching granule datasets having this dimension size

Our example below will clarify this.

Future versions of the module will implement more of the joinExisting feature set.

Examples

Here we give an example that illustrates the functionality offered by the current version of the aggregation. This example may also be found on…

http://test.opendap.org:8090/opendap/ioos/mday_joinExist.ncml

…with the data granules located in

http://test.opendap.org:8090/opendap/coverage/mday/

Granules

Assume we have some number of granule datasets with a DDS the same as the following (modulo the dataset name):

Dataset {
    Grid {
      Array:
        Float32 PHssta[time = 1][altitude = 1][lat = 4096][lon = 8192];
      Maps:
        Float64 time[time = 1];
        Float64 altitude[altitude = 1];
        Float64 lat[lat = 4096];
        Float64 lon[lon = 8192];
    } PHssta;
} PH2006001_2006031_ssta.nc;
Explicit Listing of Granules

We see that here time is the outer dimension, which is the only dimension we may join along (it is an error to specify an inner). Given some number of granules with this same shape, consider the following explicit joinExisting aggregation:

<?xml version="1.0" encoding="UTF-8"?>
<netcdf title="joinExisting test on netcdf Grid granules">
  <aggregation type="joinExisting" dimName="time" >
    <!-- Note explicit use of ncoords specifying size of "time" -->
    <netcdf location="/coverage/mday/PH2006001_2006031_ssta.nc" ncoords="1"/>
    <netcdf location="/coverage/mday/PH2006032_2006059_ssta.nc" ncoords="1"/>
    <netcdf location="/coverage/mday/PH2006060_2006090_ssta.nc" ncoords="1"/>
  </aggregation>
</netcdf>

Here’s the same aggregation using the scan element instead of explicitly listing each file:

<?xml version="1.0" encoding="UTF-8"?>
<netcdf title="joinExisting test on netcdf Grid granules using scan">
  <aggregation type="joinExisting" dimName="time" >
    <scan location="/coverage/mday/" suffix=".nc"/>
  </aggregation>
</netcdf>

First, note that the ncoords attribute should be specified on the individual granules for this version of the module. In many cases the handler will be more efficient if the ncoords attribute is used. Note that we also specify the dimName. Any data array whose outer dimension is called this will be subject to aggregation in the output.

Serving this from Hyrax will result in the following DDS:

Dataset {
    Grid {
      Array:
        Float32 PHssta[time = 3][altitude = 1][lat = 4096][lon = 8192];
      Maps:
        Float64 time[time = 3];
        Float64 altitude[altitude = 1];
        Float64 lat[lat = 4096];
        Float64 lon[lon = 8192];
    } PHssta;
    Float64 time[time = 3];
} mday_joinExist.ncml;

We see that the time dimension is now of size 3 to match that we joined three granule datasets together.

Also notice that the map vector for the joined dimension, time, has been duplicated as a sibling of the dataset. This is done automatically by the aggregation and it is copied into the actual map of the Grid. This copy is to facilitate datasets which have multiple Grid’s that are to be joined --- the top-level map vector is used as the canonical template map which is then copied into the maps for all the aggregated Grids. In the case of the joined data being of type Array, this vector would already exist as the coordinate variable for the data matrix. Since this is the source map for all aggregated Grid’s, any attribute (metadata) changes should be made explicitly on this top-level coordinate variable so that the metadata is shared among all the aggregated Grid map vectors.

Using the Scan Element

The collection of member datasets in a joinExisiting aggregation can be specified using the NcML scan element as described in the dynamic aggregation tutorial.

NcML Dimension Cache

If the scan element is used without the ncoords extension (see below), then the first time a joinExisiting aggregation is accessed (say by requesting it’s DDS) the BES process will open every file in the aggregation and cache its dimension information in the NcML dimension cache. By default the cache files are written into /tmp and the total size of the cache is limited to a maximum size of 2GB. These settings can be changed by modifying the ncml.conf file, typically located in /etc/bes/modules/ncml.conf:

#-----------------------------------------------------------------------#
# NcML Aggregation Dimension Cache Parameters                           #
#-----------------------------------------------------------------------#
# Directory into which the cache files will be stored.
NCML.DimensionCache.directory=/tmp
# Filename prefix to be used for the cache files
NCML.DimensionCache.prefix=ncml_dimension_cache
# This is the size of the cache in megabytes; e.g., 2,000 is a 2GB cache
NCML.DimensionCache.size=2000
# Maximum number of dimension allowed in any particular dataset.
# If not set in this configuration the value defaults to 100.
# NCML.DimensionCache.maxDimensions=100

The cache files are small compared to the source dataset files, typically less than 1kb for a dataset with a few named dimensions. However, the cache files are numerous, one for each file used in a joinExisiting aggregation. If you have large joinExisiting aggregations, it is important to be sure that the NCML.DimensionCache.directory has space to contain the cache and that the NCML.DimensionCache.size to an appropriately large value.

Because the first access of the aggregation triggers the population of the NcML dimension cache for that aggregation the time for this first access can be significant. It may be that typical HTTP clients will timeout before that requests completes. If a client timeout occurs dimension cache may not get fully populated, however subsequent requests will cause the cache population to pick up where it was left off.

With only a modicum of effort one could write a shell program that utilizes the BES standalone functionality to pre-populate the dimension caches for large joinExisiting aggregations.

ncoords Extension

If all of the granules are of uniform dimensional size, we may also use the syntactic sugar provided by a Hyrax-specific extension to NcML — adding the ncoords attribute to a scan element. The behavior of this extension is to set the ncoordsfor each granule matching the scan to be this value, as if the datasets were each listed explicitly with this value of the attribute. Here’s an example of using the syntactic sugar that results in the same exact aggregation as the previous explicit one:

<?xml version="1.0" encoding="UTF-8"?>
<!-- joinExisting test on netcdf granules using scan@ncoords extension-->
<netcdf title="joinExisting test on netcdf Grid granules using scan@ncoords"
    >
  <attribute name="Description" type="string"
         value=" joinExisting test on netcdf Grid granules using scan@ncoords"/>
  <aggregation type="joinExisting"
           dimName="time" >
    <!-- Filenames have lexicographic and chronological ordering match -->
    <scan location="/coverage/mday"
      subdirs="false"
      suffix=".nc"
      ncoords="1"
      />
  </aggregation>
</netcdf>

…which we see results in the same DDS:

Dataset {
    Grid {
      Array:
        Float32 PHssta[time = 3][altitude = 1][lat = 4096][lon = 8192];
      Maps:
        Float64 time[time = 3];
        Float64 altitude[altitude = 1];
        Float64 lat[lat = 4096];
        Float64 lon[lon = 8192];
    } PHssta;
    Float64 time[time = 3];
} mday_joinExist.ncml;

The advantage of this is that the server does not have to inspect all of the member granules to determine their dimensional size, which allows server to manufacture responses much more quickly.

Limitations

The current version implements only basic functionality. If there is extended functionality that is needed for your use, please send <mailto:support@opendap.org> to let us know!

Join Dimension Sizes Should Be Explicitly Declared

As we have seen, the most important limitation to the JoinExisting aggregation support is that the ncoords attribute should be specified for efficiency reasons. Future versions will continue to relax this requirement. The problem is that the size of the output join dimension is dependent on checking the DDS of every granule in the aggregation, which is computationally expensive for large aggregations.

Source of Data for Aggregated Coordinate Variable on Join Dimension

This version does not allow the join dimension’s data to be declared explicitly in the NcML as the NcML tutorial page describes. This version automatically aggregates all variables with the outer dimension matching the dimName. This includes the coordinate variable (map vector in the case of Grid’s) for the join dimension. These data cannot be overridden from those pulled from the files. Currently the TDS lists about 5 ways this data can be specified in addition to pulling them from the granules --- we only can pull them from granules now, which seems the most common use.

Source of Join Dimension Metadata

The metadata for the coordinate variable is pulled from the first granule dataset. Modification of coordinate variable metadata is not fully supported yet.

11.E.4. Union Aggregation
Introduction

The current trunk version of the module supports the union aggregation element of the form:

<netcdf>
  <aggregation type="union">
      <!-- some <netcdf> nodes -->
  </aggregation>
</netcdf>
Functionality

The union aggregation specifies the attributes and variables (and perhaps dimensions) for the dataset it is contained within (i.e. it’s parent <netcdf> node, which must be be virtual, in other words, have no location specifed). To do this it…

  • Processes each child netcdf element recursively, creating the final transformed dataset
  • Scans the processed child datasets in order of specification and:
    • Adds to the parent dataset any attribute, variable, or dimension that doesn’t already exist in the parent dataset
    • Skips any attribute or variable that already exists in the parent dataset
    • Skips any dimension already in the parent dataset, unless the lengths do not match, in which case it throws a parse error.

Note that the module processes each child dataset entirely as if it were a top level element, obeying all the normal processing for a dataset, but collecting the result into that netcdf node. This means that any child netcdf of an aggregation may refer to a location, have transformations applied to it, have metadata removed, or may even contain its own nested aggregation!

Which items will show up in the output? We need to discuss this in a little more detail, in particular since we have deviated slightly from the Unidata implementation.

Order of Element Processing

The NCML Module processes the nodes in a <netcdf> element in the order encountered. This means that the parent dataset of an aggregation may place attributes and variables into the union prior to an aggregation taking place, meaning that those items matching the name in the aggregation itself will be skipped. It also implies that any changes to existing metadata within a member of the aggregation by using an attribute element, for example, must come AFTER the actual aggregation element, or else a parse error will be thrown.

Shadowing an Aggregation Member

For example, the following examples show how to "shadow" a variable contained in an aggregation by specifying it in the parent dataset prior to the aggregation:

<netcdf>
  <variable name="Foo" type="string">
    <values>I come before the aggregation, so will appear in the output!</values>
  </variable>
  <aggregation type="union">
    <netcdf>
      <variable name="Foo" type="string">
    <values>I will be skipped since there's a Foo in the dataset prior to the aggregation.</values>
      </variable>
   </netcdf>
    <netcdf>
      <variable name="Bar" type="string">
    <values>I do not exist prior, so will be in the output!</values>
      </variable>
    </netcdf>
  </aggregation>
</netcdf>

The values make it clear what the output will be. The variable "Foo" in the first child will be skipped since the parent dataset already specified it, but the variable "Bar" in the second child dataset will show up in the output since it doesn’t already exist in either the parent or the previous child. Note that this would also work on an attribute or dimension.

Modifying the "Winner" of the Union Aggregation

The following example shows how to modify the "winning" variable in a union aggregation by specifying the attribute change AFTER the aggregation element:

<netcdf>
  <aggregation type="union">
    <netcdf>
      <variable name="Foo" type="string">
    <attribute name="Description" type="string" value="Winning Foo before we modify, should NOT be in output!"/>
    <values>I am the winning Foo!</values>
      </variable>
    </netcdf>
    <netcdf>
      <variable name="Foo" type="string">
    <attribute name="Description" type="string" value="I will be the losing Foo and should NOT be in output!"/>
    <values>I am the losing Foo!</values>
      </variable>
    </netcdf>
  </aggregation>
  <!-- Now we modify the "winner" of the previous union -->
  <variable name="Foo">
    <attribute name="Description" type="string" value="I am Foo.Description and have modified the winning Foo and deserve to be in the output!"/>
  </variable>
</netcdf>

In this case, the output dataset will have the variable Foo with a value of "I am the winning Foo!", but its metadata will have been modified by the transformation after the aggregation, so its attribute "Description" will have the value "I am Foo.Description and have modified the winning Foo and deserve to be in the output!".

If this entire netcdf element were contained within another aggregation, then other transformations might be applied after the fact as well, again in the order encountered for clarity.

Dimensions

Since the DAP2 does not specify dimensions as explicit data items, a union of dimensions is only done if the child netcdf elements explicitly declare dimensions. In practice, this is mostly of little utility since the only time dimensions are specified is to create virtual array variables (Note: we do not load dimensions from wrapped sets, so effectively they do not exist in them, even if the wrapped dataset was an NcML file!)

If a dimension does exist explicitly in a child dataset and a second with the same name is encountered in another child dataset, the cardinalities are checked and a parse error is thrown if they do not exist. This is a simple check that can be done to ensure the resulting arrays are of the correct size. Note that even if an array had a named dimension within a wrapped set, we do not check that these match at this time.

Here is an example of a valid use of dimension in the current module:

<netcdf xmlns="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2">
  <!-- Test that a correct union with dimensions in the virtual datasets will work if the dimensions match as they need to -->
  <attribute name="title" type="string" value="Testing union with dimensions"/>
  <aggregation type="union">
    <netcdf>
      <attribute name="Description" type="string" value="The first dataset"/>
      <dimension name="lat" length="5"/>
      <!-- A variable that uses the dimension, this one will be used -->
      <variable name="Grues" type="int" shape="lat">
    <attribute name="Description" type="string">I should be in the output!</attribute>
    <values>1 3 5 3 1</values>
      </variable>
    </netcdf>
    <netcdf>
      <attribute name="Description" type="string" value="The second dataset"/>
      <!-- This dimension will be skipped, but the length matches the previous as required -->
      <dimension name="lat" length="5"/>
      <!-- This dimension is new so will be used... -->
      <dimension name="station" length="3"/>
      <!-- A variable that uses it, this one will NOT be used -->
      <variable name="Grues" type="int" shape="lat">
    <attribute name="Description" type="string">!!!! I should NOT be in the output! !!!!</attribute>
    <values>-3 -5 -7 -3 -1</values>
      </variable>
      <!-- This variable uses both and will show up in output correctly -->
      <variable name="Zorks" type="int" shape="station lat">
    <attribute name="Description" type="string">I should be in the output!</attribute>
    <values>
      1  2   3   4   5
      2  4   6   8  10
      4  8  12 16 20
    </values>
      </variable>
   </netcdf>
  </aggregation>
</netcdf>

Here is an example that will produce a dimension mismatch parse error:

<netcdf xmlns="http://www.unidata.ucar.edu/namespaces/netcdf/ncml-2.2">
  <!-- Test that a union with dimensions in the virtual datasets will ERROR if the child set dimensions DO NOT match as they need to -->
  <attribute name="title" type="string" value="Testing union with dimensions"/>
  <aggregation type="union">
    <netcdf>
      <dimension name="lat" length="5"/>
      <!-- A variable that uses the dimension, this one will be used -->
      <variable name="Grues" type="int" shape="lat">
    <attribute name="Description" type="string">I should be in the output!</attribute>
    <values>1 3 5 3 1</values>
      </variable>
    </netcdf>
    <netcdf>
      <!-- This dimension WOULD be skipped, but does not match the representative and will cause an error on union! -->
      <dimension name="lat" length="6"/>
     <!-- This dimension is new so will be used... -->
      <dimension name="station" length="3"/>
      <!-- A variable that uses it, this one will NOT be used -->
      <variable name="Grues" type="int" shape="lat">
    <attribute name="Description" type="string">!!!! I should NOT be in the output! !!!!</attribute>
    <values>-3 -5 -7 -3 -3 -1</values>
      </variable>
      <!-- This variable uses both and will show up in output correctly -->
      <variable name="Zorks" type="int" shape="station lat">
    <attribute name="Description" type="string">I should be in the output!</attribute>
    <values>
      1  2   3   4   5  6
      2  4   6   8  10  12
      4  8  12 16 20  24
    </values>
      </variable>
   </netcdf>
  </aggregation>
</netcdf>

Note that the failure is that the second dataset had an extra "lat" sample added to it, but the prior dataset did not. Again, these dimension checks only occur now in a pure virtual dataset like we see here. Using netcdf@location will effectively "hide" all the dimensions within it at this point.

Thoughts About Future Directions for Dimension

For a future implementation, we may want to consider a DAP2 Grid Map vector as a dimension and do cardinality checks on them if we have multiple grids in a union each of which specify the same names for their map vectors. One argument is that this should be done if an explicit dimension element with the map vector name is specified in the parent dataset and is explicitly specified as "isShared". Although DAP2 does not have shared dimensions, this would be a basic first step in the error checking that will have to be done for shared dimensions.

Notes About Changes from NcML 2.2 Implementation

In the Aggregation tutorial, it is mentioned that in a given <netcdf> node, the <aggregation> element is process prior to any other nodes, which reflects an explicitly DOM implementation of the NcML parser. Since we are using a SAX parser for efficiency, we cannot follow this prescription. Instead, we process the elements in the order encountered. We argue that this approach, while more efficient, also allows for more explicit control over which attributes and variables show up in the dataset which is the parent node of the aggregation. The examples above show this extra power gained by allowing elements to be added to the resultant dataset prior to or after the aggregation has been processed. In particular, it will let us shadow potential members of the aggregation.

11.E.5. JoinNew Explicit Dataset Tutorial
Default Values for the New Coordinate Variable (on a Grid)

The default for the new coordinate variable is to be of type String with the location of the dataset as the value. For example, the following NcML file…

<?xml version="1.0" encoding="UTF-8"?>
<netcdf title="Simple test of joinNew Grid aggregation">
  <aggregation type="joinNew" dimName="filename">
    <variableAgg name="dsp_band_1"/>
    <netcdf location="data/ncml/agg/grids/f97182070958.hdf"/>
    <netcdf location="data/ncml/agg/grids/f97182183448.hdf"/>
    <netcdf location="data/ncml/agg/grids/f97183065853.hdf"/>
    <netcdf location="data/ncml/agg/grids/f97183182355.hdf"/>
  </aggregation>
</netcdf>

…specifies an aggregation on a Grid variable dsp_band_1 sampled in four HDF4 datasets listed explicitly.

First, the data structure (DDS) is:

Dataset {
    Grid {
      Array:
        UInt32 dsp_band_1[filename = 4][lat = 1024][lon = 1024];
      Maps:
        String filename[filename = 4];
        Float64 lat[1024];
        Float64 lon[1024];
    } dsp_band_1;
    String filename[filename = 4];
} joinNew_grid.ncml;

We see the aggregated variable dsp_band_1 has the new outer dimension filename. A coordinate variable filename[filename]' was created as a sibling of the aggregated variable (the top level Grid we specified) and was also copied into the aggregated Grid as a new map vector.

The ASCII data response for just the new coordinate variable filename[filename] is:

String filename[filename = 4] = {"data/ncml/agg/grids/f97182070958.hdf",
"data/ncml/agg/grids/f97182183448.hdf",
"data/ncml/agg/grids/f97183065853.hdf",
"data/ncml/agg/grids/f97183182355.hdf"};

We see that the absolute location we specified for the dataset as a String is the value for each element of the new coordinate variable.

The newly added map dsp_band_1.filename contains a copy of this data.

Explicitly Specifying the New Coordinate Variable

If the author wishes to have the new coordinate variable be of a specific data type with non-uniform values, then they must specify the new coordinate variable explicitly.

Array Virtual Dataset

Here’s an example using a contrived pure virtual dataset:

<?xml version="1.0" encoding="UTF-8"?>
<netcdf title="JoinNew on Array with Explicit Map">
  <!-- joinNew and form new outer dimension "day" -->
  <aggregation type="joinNew" dimName="day">
    <variableAgg name="V"/>
    <netcdf title="Slice 1">
      <dimension name="sensors" length="3"/>
      <variable name="V" type="int" shape="sensors">
    <values>1 2 3</values>
      </variable>
    </netcdf>
    <netcdf title="Slice 2">
      <dimension name="sensors" length="3"/>
      <variable name="V" type="int" shape="sensors">
    <values>4 5 6</values>
      </variable>
    </netcdf>
  </aggregation>
  <!-- This is recognized as the definition of the new coordinate variable,
       since it has the form day[day] where day is the dimName for the aggregation.
       It MUST be specified after the aggregation, so that the dimension size of day
      has been calculated.
  -->
  <variable name="day" type="int" shape="day">
    <!-- Note: metadata may be added here as normal! -->
    <attribute name="units" type="string">Days since 01/01/2010</attribute>
    <values>1 30</values>
  </variable>
</netcdf>

The resulting DDS:

Dataset {
    Int32 V[day = 2][sensors = 3];
    Int32 day[day = 2];
} joinNew_with_explicit_map.ncml;

…and the ASCII data:

Int32 V[day = 2][sensors = 3] = {{1, 2, 3},{4, 5, 6}};
Int32 day[day = 2] = {1, 30};

Note that the values we have explicitly given are used here as well as the specified NcML type, int which is mapped to a DAP Int32.

If metadata is desired on the new coordinate variable, it may be added just as in a normal new variable declaration. We’ll give more examples of this later.

Grid with Explicit Map

Let’s give one more example using a Grid to demonstrate the recognition of the coordinate variable as it is added to the Grid as the map vector for the new dimension:

<?xml version="1.0" encoding="UTF-8"?>
<netcdf title="joinNew Grid aggregation with explicit map">
  <aggregation type="joinNew" dimName="sample_time">
    <variableAgg name="dsp_band_1"/>
    <netcdf location="data/ncml/agg/grids/f97182070958.hdf"/>
    <netcdf location="data/ncml/agg/grids/f97182183448.hdf"/>
    <netcdf location="data/ncml/agg/grids/f97183065853.hdf"/>
    <netcdf location="data/ncml/agg/grids/f97183182355.hdf"/>
  </aggregation>
  <!-- Note: values are contrived -->
  <variable name="sample_time" shape="sample_time" type="float">
    <!-- Metadata here will also show up in the Grid map -->
    <attribute name="units" type="string">Days since 01/01/2010</attribute>
    <values>100 200 400 1000</values>
  </variable>
</netcdf>

This produces the DDS:

Dataset {
    Grid {
      Array:
        UInt32 dsp_band_1[sample_time = 4][lat = 1024][lon = 1024];
      Maps:
        Float32 sample_time[sample_time = 4];
        Float64 lat[1024];
        Float64 lon[1024];
    } dsp_band_1;
    Float32 sample_time[sample_time = 4];
} joinNew_grid_explicit_map.ncml;

You can see the explicit coordinate variable sample_time was found as the sibling of the aggregated Grid as was added as the new map vector for the Grid.

The values for the projected coordinate variables are as expected:

Float32 sample_time[sample_time = 4] = {100, 200, 400, 1000};

Errors

It is a Parse Error to…

  • Give a different number of values for the explicit coordinate variable than their are specified datasets.
  • Specify the new coordinate variable prior to the <aggregation> element since the dimension size is not yet known.
Autogenerated Uniform Numeric Values

If the number of datasets might vary (for example, if a <scan> element, described later, is used), but the values are uniform, the start/increment version of the <values> element may be used to generate the values for the new coordinate variable. For example…

<?xml version="1.0" encoding="UTF-8"?>
<netcdf title="JoinNew on Array with Explicit Autogenerated Map">
  <aggregation type="joinNew" dimName="day">
    <variableAgg name="V"/>
    <netcdf title="Slice 1">
      <dimension name="sensors" length="3"/>
      <variable name="V" type="int" shape="sensors">
    <values>1 2 3</values>
      </variable>
    </netcdf>
    <netcdf title="Slice 2">
      <dimension name="sensors" length="3"/>
      <variable name="V" type="int" shape="sensors">
    <values>4 5 6</values>
      </variable>
    </netcdf>
  </aggregation>
  <!-- Explicit coordinate variable definition -->
  <variable name="day" type="int" shape="day">
    <attribute name="units" type="string" value="days since 2000-01-01 00:00"/>
    <!-- We sample once a week... -->
    <values start="1" increment="7"/>
  </variable>
</netcdf>

The DDS is the same as before and the coordinate variable is generated as expected:

Int32 sample_time[sample_time = 4] = {1, 8, 15, 22};

Note that this form is useful for uniform sampled datasets (or if only a numeric index is desired) where the variable need not be changed as datasets are added. It is especially useful for a <scan> element that refers to a dynamic number of files that can be described with a uniformly varying index.

Explicitly Using coordValue Attribute of <netcdf>

The netcdf@coordValue may be used to specify the value for the given dataset right where the dataset is declared. This attribute will cause a coordinate variable to be automatically generated with the given values for each dataset filled in. The new coordinate variable will be of type double if the coordValue’s can all be parsed as a number, otherwise they will be of type String.

String coordValue Example

<?xml version="1.0" encoding="UTF-8"?>
<netcdf title="joinNew Aggregation with explicit string coordValue">
  <aggregation type="joinNew" dimName="source">
    <variableAgg name="u"/>
    <variableAgg name="v"/>
    <!-- Same dataset a few times, but with different coordVal -->
    <netcdf title="Dataset 1" location="data/ncml/fnoc1.nc" coordValue="Station_1"/>
    <netcdf title="Dataset 2" location="data/ncml/fnoc1.nc" coordValue="Station_2"/>
    <netcdf title="Dataset 3" location="data/ncml/fnoc1.nc" coordValue="Station_3"/>
  </aggregation>
</netcdf>

This results in the following DDS:

Dataset {
    Int16 u[source = 3][time_a = 16][lat = 17][lon = 21];
    Int16 v[source = 3][time_a = 16][lat = 17][lon = 21];
    Float32 lat[lat = 17];
    Float32 lon[lon = 21];
    Float32 time[time = 16];
    String source[source = 3];
} joinNew_string_coordVal.ncml;

…and ASCII data response of the projected coordinate variable is:

String source[source = 3] = {"Station_1", "Station_2", "Station_3"};

…as we specified.

Numeric (double) Use of coordValue

If the first coordValue can be successfully parsed as a double numeric type, then a coordinate variable of type double (Float64) is created and all remaining coordValue specifications must be parsable as a double or a Parse Error is thrown.

Using the same example but with numbers instead:

<?xml version="1.0" encoding="UTF-8"?>
<netcdf title="joinNew Aggregation with numeric coordValue">
  <aggregation type="joinNew" dimName="source">
    <variableAgg name="u"/>
    <variableAgg name="v"/>
    <!-- Same dataset a few times, but with different coordVal -->
    <netcdf title="Dataset 1" location="data/ncml/fnoc1.nc" coordValue="1.2"/>
    <netcdf title="Dataset 2" location="data/ncml/fnoc1.nc" coordValue="3.4"/>
    <netcdf title="Dataset 3" location="data/ncml/fnoc1.nc" coordValue="5.6"/>
  </aggregation>
</netcdf>

This time we see that a Float64 array is created:

Dataset {
    Int16 u[source = 3][time_a = 16][lat = 17][lon = 21];
    Int16 v[source = 3][time_a = 16][lat = 17][lon = 21];
    Float32 lat[lat = 17];
    Float32 lon[lon = 21];
    Float32 time[time = 16];
    Float64 source[source = 3];
} joinNew_numeric_coordValue.ncml;

The values we specified are in the coordinate variable ASCII data:

Float64 source[source = 3] = {1.2, 3.4, 5.6};
11.E.6. Metadata on Aggregations Tutorial
Metadata Specification on the New Coordinate Variable

We can add metadata to the new coordinate variable in two ways:

  • Adding it to the <variable> element directly in the case where the new coordinate variable and values is defined explicitly
  • Adding the metadata to an automatically created coordinate variable by leaving the <values> element out

The first case we have already seen, but we will show it again explicitly. The second case is a little different and we’ll cover it separately.

Adding Metadata to the Explicit New Coordinate Variable

We have already seen examples of explicitly defining the new coordinate variable and giving its values. In these cases, the metadata is added to the new coordinate variable exactly like any other variable. Let’s see the example again:

<?xml version="1.0" encoding="UTF-8"?>
<netcdf title="joinNew Grid aggregation with explicit map">
  <aggregation type="joinNew" dimName="sample_time">
    <variableAgg name="dsp_band_1"/>
    <netcdf location="data/ncml/agg/grids/f97182070958.hdf"/>
    <netcdf location="data/ncml/agg/grids/f97182183448.hdf"/>
    <netcdf location="data/ncml/agg/grids/f97183065853.hdf"/>
    <netcdf location="data/ncml/agg/grids/f97183182355.hdf"/>
  </aggregation>
  <variable name="sample_time" shape="sample_time" type="float">
    <!-- Metadata here will also show up in the Grid map -->
    <attribute name="units" type="string">Days since 01/01/2010</attribute>
    <values>100 200 400 1000</values>
  </variable>
</netcdf>

We see that the units attribute for the new coordinate variable has been specified. This subset of the DAS (we don’t show the extensive global metadata) shows this:

dsp_band_1 {
        Byte dsp_PixelType 1;
        Byte dsp_PixelSize 2;
        UInt16 dsp_Flag 0;
        UInt16 dsp_nBits 16;
        Int32 dsp_LineSize 0;
        String dsp_cal_name "Temperature";
        String units "Temp";
        UInt16 dsp_cal_eqnNumber 2;
        UInt16 dsp_cal_CoeffsLength 8;
        Float32 dsp_cal_coeffs 0.125, -4;
        Float32 scale_factor 0.125;
        Float32 add_off -4;
        sample_time {
 --->           String units "Days since 01/01/2010";
        }
        dsp_band_1 {
        }
        lat {
            String name "lat";
            String long_name "latitude";
        }
        lon {
            String name "lon";
            String long_name "longitude";
        }
    }
    sample_time {
--->        String units "Days since 01/01/2010";
    }

We show the new metadata with the "--→" marker. Note that the metadata for the coordinate variable is also copied into the new map vector of the aggregated Grid.

Metadata can be specified in this way for any case where the new coordinate variable is listed explicitly.

Adding Metadata to An Autogenerated Coordinate Variable

If we expect the coordinate variable to be automatically added, we can also specify its metadata by referring to the variable without setting its values. This is useful in the case of using netcdf@coordValue and we will also see it is very useful when using a <scan> element for dynamic aggregations.

Here’s a trivial example using the default case of the filename:

<?xml version="1.0" encoding="UTF-8"?>
<netcdf title="Test of adding metadata to the new map vector in a joinNew Grid aggregation">
  <aggregation type="joinNew" dimName="filename">
    <variableAgg name="dsp_band_1"/>
    <netcdf location="data/ncml/agg/grids/f97182070958.hdf"/>
  </aggregation>
  <!--
       Add metadata to the created new outer dimension variable after
       the aggregation is defined by using a placeholder variable
       whose values will be defined automatically by the aggregation.
  -->
  <variable type="string" name="filename">
    <attribute name="units" type="string">Filename of the dataset</attribute>
  </variable>
</netcdf>

Note here that we just neglected to add a <values> element since we want the values to be generated automatically by the aggregation. Note also that this is almost the same way we’d modify an existing variable’s metadata. The only difference is we need to "declare" the type of the variable here since technically the variable specified here is a placeholder for the generated coordinate variable. So after the aggregation is specified, we are simply modifying the created variable’s metadata, in this case the newly generated map vector.

Here is the DAS portion with just the aggregated Grid and the new coordinate variable:

  dsp_band_1 {
        Byte dsp_PixelType 1;
        Byte dsp_PixelSize 2;
        UInt16 dsp_Flag 0;
        UInt16 dsp_nBits 16;
        Int32 dsp_LineSize 0;
        String dsp_cal_name "Temperature";
        String units "Temp";
        UInt16 dsp_cal_eqnNumber 2;
        UInt16 dsp_cal_CoeffsLength 8;
        Float32 dsp_cal_coeffs 0.125, -4;
        Float32 scale_factor 0.125;
        Float32 add_off -4;
        filename {
            String units "Filename of the dataset";
        }
        dsp_band_1 {
        }
        lat {
            String name "lat";
            String long_name "latitude";
        }
        lon {
            String name "lon";
            String long_name "longitude";
        }
    }
    filename {
        String units "Filename of the dataset";
    }

Here also the map vector gets a copy of the coordinate variable’s metadata.

We can also use this syntax in the case that netcdf@coordValue was used to autogenerate the coordinate variable:

<?xml version="1.0" encoding="UTF-8"?>
<netcdf title="joinNew Grid aggregation with coordValue and metadata">
  <aggregation type="joinNew" dimName="sample_time">
    <variableAgg name="dsp_band_1"/>
    <netcdf location="data/ncml/agg/grids/f97182070958.hdf" coordValue="1"/>
    <netcdf location="data/ncml/agg/grids/f97182183448.hdf" coordValue="10"/>
    <netcdf location="data/ncml/agg/grids/f97183065853.hdf" coordValue="15"/>
    <netcdf location="data/ncml/agg/grids/f97183182355.hdf" coordValue="25"/>
  </aggregation>
  <!-- Note: values are contrived -->
  <variable name="sample_time" shape="sample_time" type="double">
    <attribute name="units" type="string">Days since 01/01/2010</attribute>
  </variable>
</netcdf>

Here we see the metadata added to the new coordinate variable and associated map vector:

Attributes {
   dsp_band_1 {
        Byte dsp_PixelType 1;
        Byte dsp_PixelSize 2;
        UInt16 dsp_Flag 0;
        UInt16 dsp_nBits 16;
        Int32 dsp_LineSize 0;
        String dsp_cal_name "Temperature";
        String units "Temp";
        UInt16 dsp_cal_eqnNumber 2;
        UInt16 dsp_cal_CoeffsLength 8;
        Float32 dsp_cal_coeffs 0.125, -4;
        Float32 scale_factor 0.125;
        Float32 add_off -4;
        sample_time {
 --->           String units "Days since 01/01/2010";
        }
        dsp_band_1 {
        }
        lat {
            String name "lat";
            String long_name "latitude";
        }
        lon {
            String name "lon";
            String long_name "longitude";
        }
    }
    sample_time {
--->        String units "Days since 01/01/2010";
    }
}

Parse Errors

Since the processing of the aggregation takes a few steps, care must be taken in specifying the coordinate variable in the cases of autogenerated variables.

In particular, it is a Parse Error…

  • To specify the shape of the autogenerated coordinate variable if <values> are not set
  • To leave out the type or to use a type that does not match the autogenerated type

The second can be somewhat tricky to remember since for existing variables it can be safely left out and the variable will be "found". Since aggregations get processed fulled when the <netcdf> element containing them is closed, the specified coordinate variables in these cases are placeholders for the automatically generated variables, so they must match the name and type, but not specify a shape since the shape (size of the new aggregation dimension) is not known until this occurs.

Metadata Specification on the Aggregation Variable Itself

It is also possible to add or modify the attributes on the aggregation variable itself. If it is a Grid, metadata can be modified on the contained array or maps as well. Note that the aggregated variable begins with the metadata from the first dataset specified in the aggregation just like in a union aggregation.

We will use a Grid as our primary example since other datatypes are similar and simpler and this case will cover those as well.

An Aggregated Grid example

Let’s start from this example aggregation:

<?xml version="1.0" encoding="UTF-8"?>
<netcdf>
  <aggregation type="joinNew" dimName="filename">
    <variableAgg name="dsp_band_1"/>
    <netcdf location="data/ncml/agg/grids/f97182070958.hdf"/>
    <netcdf location="data/ncml/agg/grids/f97182183448.hdf"/>
    <netcdf location="data/ncml/agg/grids/f97183065853.hdf"/>
    <netcdf location="data/ncml/agg/grids/f97183182355.hdf"/>
  </aggregation>
</netcdf>

Here is the DAS for this unmodifed aggregated Grid (with the global dataset metadata removed):

Attributes {
   dsp_band_1 {
        Byte dsp_PixelType 1;
        Byte dsp_PixelSize 2;
        UInt16 dsp_Flag 0;
        UInt16 dsp_nBits 16;
        Int32 dsp_LineSize 0;
        String dsp_cal_name "Temperature";
        String units "Temp";
        UInt16 dsp_cal_eqnNumber 2;
        UInt16 dsp_cal_CoeffsLength 8;
        Float32 dsp_cal_coeffs 0.125, -4;
        Float32 scale_factor 0.125;
        Float32 add_off -4;
        filename {
        }
        dsp_band_1 {
        }
        lat {
            String name "lat";
            String long_name "latitude";
        }
        lon {
            String name "lon";
            String long_name "longitude";
        }
    }
    filename {
    }
}

We will now add attributes to all the existing parts of the Grid:

  • The Grid Structure itself
  • The Array of data within the Grid
  • Both existing map vectors (lat and lon)

We have already seen how to add data to the new coordinate variable as well.

Here’s the NcML we will use. Note we have added units data to the subparts of the Grid, and also added some metadata to the grid itself.

<?xml version="1.0" encoding="UTF-8"?>
<netcdf title="Showing how to add metadata to all parts of an aggregated grid">
  <aggregation type="joinNew" dimName="filename">
    <variableAgg name="dsp_band_1"/>
    <netcdf location="data/ncml/agg/grids/f97182070958.hdf"/>
    <netcdf location="data/ncml/agg/grids/f97182183448.hdf"/>
    <netcdf location="data/ncml/agg/grids/f97183065853.hdf"/>
    <netcdf location="data/ncml/agg/grids/f97183182355.hdf"/>
  </aggregation>
  <variable name="dsp_band_1" type="Structure"> <!-- Enter the Grid level scope -->
1)  <attribute name="Info" type="String">This is metadata on the Grid itself.</attribute>
    <variable name="dsp_band_1"> <!-- Enter the scope of the Array dsp_band_1 -->
2)    <attribute name="units" type="String">Temp (packed)</attribute> <!-- Units of the array -->
    </variable> <!-- dsp_band_1.dsp_band_1 -->
    <variable name="lat"> <!-- dsp_band_1.lat map -->
3)    <attribute name="units" type="String">degrees_north</attribute>
    </variable>
    <variable name="lon"> <!-- dsp_band_1.lon map -->
4)    <attribute name="units" type="String">degrees_east</attribute>
    </variable> <!-- dsp_band_1.lon map -->
  </variable> <!-- dsp_band_1 Grid -->
  <!-- Note well: this is a new coordinate variable so requires the correct type.
  Also note that it falls outside of the actual grid since we must specify it
  as a sibling coordinate variable it will be made into a Grid when the netcdf is closed.
  -->
  <variable name="filename" type="String">
5)  <attribute name="Info" type="String">Filename with timestamp</attribute>
  </variable> <!-- filename -->
</netcdf

Here we show metadata being injected in several ways, denoted by the 1) — 5) notations.

1) We are inside the scope of the top-level Grid variable, so this metadata will show up in the attribute table inside the Grid Structure.

2) This is the actual data Array of the Grid, dsp_band_1.dsp_band_1. We specify the units are a packed temperature. 3) Here we are in the scope of a map variable, dsp_band_1.lat. We add the units specification to this map.

4) Likewise, we add units to the lon map vector.

5) Finally, we must close the actual grid and specify the metadata for the NEW coordinate variable as a sibling of the Grid since this will be used as the canonical prototype to be added to all Grid’s which are to be aggregated on the new dimension. Note in this case (unlike previous cases) the type of the new coordinate variable is required since we are specifying a "placeholder" variable for the new map until the Grid is actually processed once its containing <netcdf> is closed (i.e. all data is available to it).

The resulting DAS (with global dataset metadata removed for clarity):

Attribute {
... global data clipped ...
  dsp_band_1 {
        Byte dsp_PixelType 1;
        Byte dsp_PixelSize 2;
        UInt16 dsp_Flag 0;
        UInt16 dsp_nBits 16;
        Int32 dsp_LineSize 0;
        String dsp_cal_name "Temperature";
        String units "Temp";
        UInt16 dsp_cal_eqnNumber 2;
        UInt16 dsp_cal_CoeffsLength 8;
        Float32 dsp_cal_coeffs 0.125, -4;
        Float32 scale_factor 0.125;
        Float32 add_off -4;
1)   String Info "This is metadata on the Grid itself.";
        filename {
5)       String Info "Filename with timestamp";
        }
        dsp_band_1 {
2)        String units "Temp (packed)";
        }
        lat {
            String name "lat";
            String long_name "latitude";
3)        String units "degrees_north";
        }
        lon {
            String name "lon";
            String long_name "longitude";
4)        String units "degrees_east";
        }
    }
    filename {
5)    String Info "Filename with timestamp";
    }
}

We have annotated the DAS with numbers representing which lines in the NcML above correspond to the injected metadata.

11.E.7. Dynamic Aggregation Tutorial
Introduction

Dynamic aggregation is achieved through the use of the scan element.

The NcML-2.2 scan element schema:

<xsd:element name="scan" minOccurs="0" maxOccurs="unbounded">
  <xsd:complexType>
    <xsd:attribute name="location" type="xsd:string" use="required"/>
    <xsd:attribute name="regExp" type="xsd:string" />
    <xsd:attribute name="suffix" type="xsd:string" />
    <xsd:attribute name="subdirs" type="xsd:boolean" default="true"/>
    <xsd:attribute name="olderThan" type="xsd:string" />
    <xsd:attribute name="dateFormatMark" type="xsd:string" />
    <xsd:attribute name="enhance" type="xsd:string"/>
  </xsd:complexType>
</xsd:element>

This document discusses the use and significance of scan in creating dynamically aggregated datasets.

Location (Location Location…)

The most important attribute of the scan element is the scan@location element that specifies the top-level search directory for the scan, relative to the BES data root directory specified in the BES configuration.

exclamation icon ALL locations are interpreted relative to the BES root directory and NOT relative to the location of the NcML file itself! This means that all data to be aggregated must be in a subdirectory of the BES root data directory and that these directories must be specified fully, not relative to the NcML file.

For example, if the BES root data dir is "/usr/local/share/hyrax", let ${BES_DATA_ROOT} refer to this location. If the NcML aggregation file is in "${BES_DATA_ROOT}/data/ncml/myAgg.ncml" and the aggregation member datasets are in "${BES_DATA_ROOT}/data/hdf4/myAggDatasets", then the location in the NcML file for the aggregation data directory would be…

<scan location="data/hdf4/myAggDatasets" />

…which specifies the data directory relative to the BES data root as required.

Again, for security reasons, the data is always searched under the BES data root. Trying to specify an absolute filesystem path, such as…

<scan location="/usr/local/share/data" />

…will NOT work. This directory will also be assumed to be a subdirectory of the ${BES_DATA_ROOT}, regardless of the preceding "/" character.

Suffix Criterion

The simplest criterion is to match only files of a certain datatype in a given directory. This is useful for filtering out text files and other files that may exist in the directory but which do not form part of the aggregation data.

Here’s a simple example:

<?xml version="1.0" encoding="UTF-8"?>
<netcdf title="Example of joinNew Grid aggregation using the scan element.">
 <aggregation type="joinNew" dimName="filename">
   <variableAgg name="dsp_band_1"/>
   <scan location="data/ncml/agg/grids" suffix=".hdf" />
 </aggregation>
</netcdf>

Assuming that the specified location "data/ncml/agg/grids" contains no subdirectories, this NcML will return all files in that directory that end in ".hdf" in alphanumerical order. In the case of our installed example data, there are four HDF4 files in that directory:

data/ncml/agg/grids/f97182070958.hdf
data/ncml/agg/grids/f97182183448.hdf
data/ncml/agg/grids/f97183065853.hdf
data/ncml/agg/grids/f97183182355.hdf

These will be included in alphanumerical order, so the scan element will in effect be equivalent to the following list of <netcdf> elements:

<netcdf location="data/ncml/agg/grids/f97182070958.hdf"/>
<netcdf location="data/ncml/agg/grids/f97182183448.hdf"/>
<netcdf location="data/ncml/agg/grids/f97183065853.hdf"/>
<netcdf location="data/ncml/agg/grids/f97183182355.hdf"/>

By default, scan will search subdirectories, which is why we mentioned "grids has no subdirectories". We discuss this in the next section.

Subdirectory Searching (The Default!)

If the author specifies the scan@subdirs attribute to the value "true" (which is the default!), then the criteria will be applied recursively to any subdirectories of the scan@location base scan directory as well as to any regular files in the base directory.

For example, continuing our previous example, but giving a higher level location:

<?xml version="1.0" encoding="UTF-8"?>
<netcdf title="joinNew Grid aggregation using the scan element.">
 <aggregation type="joinNew" dimName="filename">
   <variableAgg name="dsp_band_1"/>
   <scan location="data/ncml/agg/" suffix=".hdf" subdirs="true"/>
 </aggregation>
</netcdf>

Assuming that only the "grids" subdir of "/data/ncml/agg" contains HDF4 files with that extension, the same aggregation as prior will be created, in other words, an aggregation isomorphic to:

<netcdf location="data/ncml/agg/grids/f97182070958.hdf"/>
<netcdf location="data/ncml/agg/grids/f97182183448.hdf"/>
<netcdf location="data/ncml/agg/grids/f97183065853.hdf"/>
<netcdf location="data/ncml/agg/grids/f97183182355.hdf"/>

The scan@subdirs attribute is much for useful for turning off the default recursion. For example, if recursion is NOTdesired, but only files with the given suffix in the given directory are required, the following will do that:

<?xml version="1.0" encoding="UTF-8"?>
<netcdf title="joinNew Grid aggregation using the scan element.">
 <aggregation type="joinNew" dimName="filename">
   <variableAgg name="dsp_band_1"/>
   <scan location="data/ncml/agg/grids" suffix=".hdf" subdirs="false"/>
 </aggregation>
</pre>

OlderThan Criterion

The scan@olderThan attribute can be used to filter out files that are "too new". This feature is useful for excluding partial files currently being written by a daemon process, for example.

The value of the attribute is a duration specified by a number followed by a basic time unit. The time units recognized are as follows:

  • seconds: \{ s, sec, secs, second, seconds }
  • minutes: \{ m, min, mins, minute, minutes }
  • hours: \{ h, hour, hours }
  • days: \{ day, days }
  • months: \{ month, months }
  • years: \{ year, years }

The strings inside \{ } are all recognized as referring to the given time unit.

For example, if we are following our previous example, but we suspect a new HDF file may be written at any time and usually takes 5 minutes to do so, we might use the following NcML:

<?xml version="1.0" encoding="UTF-8"?>
<netcdf title="joinNew Grid aggregation using the scan element.">
 <aggregation type="joinNew" dimName="filename">
   <variableAgg name="dsp_band_1"/>
   <scan location="data/ncml/agg/grids" suffix=".hdf" subdirs="false" olderThan="10 mins" />
 </aggregation>
</netcdf>

Assuming the file will always be written withing 10 minutes, this files does what we wish. Only files whose modification date is older than the given duration from the current system time are included.

NOTE that the modification date of the file, not the creation date, is used for the test.

Regular Expression Criterion

The scan@regExp attribute may be used for more complicated filename matching tests where data for multiple variables, for example, may live in the same directory by whose filenames can be used to distinguish which are desired in the aggregation. Additionally, since the pathname including the location is used for the test, a regular expression test may be used in conjunction with a recursive directory search to find files in subdirectories where the directory name itself is specified in the regular expression, not just the filename. We’ll give examples of both of these cases.

We also reiterate that this test is used in conjunction with any other tests --- the author may also include a suffix and an olderThan test if they wish. All criteria must match for the file to be included in the aggregation.

We recognize the POSIX regular expression syntax. For more information on regular expressions and the POSIX syntax, please see: http://en.wikipedia.org/wiki/Regular_expression.

Consider the following, basic examples:

  • Finding all subdirectories with a given name
  • Matching a filename starting with a certain substring

Matching a Subdirectory Name

Here’s an example where we use a subdirectory search to find ".hdf" files in all subdirectories named "grids":

<?xml version="1.0" encoding="UTF-8"?>
<netcdf title="Example of joinNew Grid aggregation using the scan element with a regexp">
 <aggregation type="joinNew" dimName="filename">
   <variableAgg name="dsp_band_1"/>
   <scan
      location="data/"
      subdirs="true"
      regExp="^.*/grids/.+\.hdf$"
      />
 </aggregation>
</netcdf>

The regular expression here is "^.*/grids/.+\/hdf". Let’s pull it apart quickly (this is not intended to be a regular expression tutorial):

The "^" matching the beginning of the string, so starts at the beginning of the location pathname. (without this we can match substrings in the middle of strings, etc)

We then match ".*" meaning 0 or more of any character.

We then match the "/grids/" string explicitly, meaning we want all pathnames that contain "/grids/" as a subdirectory.

We then match ".+" meaning 1 or more of any character.

We then match "\." meaning a literal "." character (the backslash "escapes" it).

We then match the suffix "hdf".

Finally, we match "$" meaning the end of the string.

So ultimately, this regular expression finds all filenames ending in ".hdf" that exist in some subdirectory named "grids" of the top-level location.

In following with our previous example, if there was only the one "grids" subdirectory in the ${BES_DATA_ROOT} with our four familiar files, we’d get the same aggregation as before.

Matching a Partial Filename

Let’s say we have a given directory full of data files whose filename prefix specifies which variable they refer to. For example, let’s say our "grids" directory has files that start with "grad" as well as the files that start with "f" we have seen in our examples. We still want just the files starting with "f" to filter out the others. Here’s an example for that:

<?xml version="1.0" encoding="UTF-8"?>
<netcdf title="Example of joinNew Grid aggregation using the scan element with a regexp">
 <aggregation type="joinNew" dimName="filename">
   <variableAgg name="dsp_band_1"/>
   <scan
      location="data/"
      subdirs="true"
      regExp="^.*/grids/f.+\.hdf$"
      />
 </aggregation>
</netcdf>

Here we match all pathnames ending in "grids" and files that start with the letter "f" and end with ".hdf" as we desire.

Date Format Mark and Timestamp Extraction

This section shows how to use the scan@dateFormatMark attribute along with other search criteria in order to extract and sort datasets by a timestamp encoded in the filename. All that is required is that the timestamp be parseable by a pattern recognized by the Java language "SimpleDateFormat" class, which has also been implemented in C++ in the International Components for Unicode library which we use.

We base this example from the Unidata site Aggregation Tutorial. Here we have a directory with four files whose filenames contain a timestamp describable by a SimpleDataFormat (SDF) pattern. We will also use a regular expression criterion and suffix criterion in addition to the dateFormatMark since we have other files in the same directory and only wish to match those starting with the characters "CG" that have suffix ".nc".

Here’s the list of files (relative to the BES data root dir):

data/ncml/agg/dated/CG2006158_120000h_usfc.nc
data/ncml/agg/dated/CG2006158_130000h_usfc.nc
data/ncml/agg/dated/CG2006158_140000h_usfc.nc
data/ncml/agg/dated/CG2006158_150000h_usfc.nc

Here’s the NcML:

<?xml version="1.0" encoding="UTF-8"?>
<netcdf title="Test of joinNew aggregation using the scan element and dateFormatMark">
 <aggregation type="joinNew" dimName="fileTime">
   <variableAgg name="CGusfc"/>
   <scan
       location="data/ncml/agg/dated"
       suffix=".nc"
       subdirs="false"
       regExp="^.*/CG[^/]*"
       dateFormatMark="CG#yyyyDDD_HHmmss"
   />
 </aggregation>
</netcdf>

So here we joinNew on the new outer dimension fileTime. The new coordinate variable fileTime[fileTime] for this dimension will be an Array of type String that will contain the parsed ISO 8601 timestamps we will extract from the matching filenames.

We have specified that we want only Netcdf files (suffix ".nc") which match the regular expression "./CG[/]". This means match the start of the string, then any number of characters that end with a "/" (the path portion of the filename), then the letters "CG", then some number of characters that do not include the "/" character (which is what "[^/]*" means). Essentially, we want files whose basename (path stripped) start with "CG" and end with ".nc". We also do not want to recurse, but only look in the location directory "/data/ncml/agg/dated" for the files.

Finally, we specify the scan@dateFormatMark pattern to describe how to parse the filename into an ISO 8601 date. The dateFormatMark is processed as follows:

  • Skip the number of characters prior to the "#" mark in the pattern while scanning the base filename (no path)
  • Interpret the next characters of the file basename using the given SimpleDateFormat string
  • Ignore any characters after the SDF portion of the filename (such as the suffix)

First, note that we do not match the characters in the dateFormatMark --- they are simply counted and skipped. So rather than "CG#" specifying the prefix before the SDF, we could have also used "XX#". This is why we must also use a regular expression to filter out files with other prefixes that we do not want in the aggregation. Note that the "#" is just a marker for the start of the SDF pattern and doesn’t count as an actual character in the matching process.

Second, we specify the dateFormatMark (DFM) as the following SDF pattern: "yyyyDDD_HHmmss". This means that we use the four digit year, then the day of the year (a three digit number), then an underscore ("_") separator, then the 24 hour time as 6 digits. Let’s take the basename of the first file as an example:

"CG2006158_120000h_usfc.nc"

We skip two characters due to the "CG#" in the DFM. Then we want to match the "yyyy" pattern for the year with: "2006".

We then match the day of the year as "DDD" which is "158", the 158th day of the year for 2006.

We then match the underscore character "_" which is only a separator.

Next, we match the 24 hour time "HHmmss" as 12:00:00 hours:mins:secs (i.e. noon).

Finally, any characters after the DFM are ignored, here "h_usfc.nc".

We see that the four dataset files are on the same day, but sampled each hour from noon to 3 pm.

These parsed timestamps are then converted to an ISO 8601 date string which is used as the value for the coordinate variable element corresponding to that aggregation member. The first file would thus have the time value "2006-06-07T12:00:00Z", which is 7 June 2006 at noon in the GMT timezone.

The matched files are then sorted using the ISO 8601 timestamp as the sort key and added to the aggregation in this order. Since ISO 8601 is designed such that lexicographic order is isomorphic to chronological order, this orders the datasets monotonically in time from past to future. This is different from the <scan> behavior without a dateFormatMark specified, where files are ordered lexicographically (alphanumerically by full pathname) --- this order may or may not match chronological order.

If we project out the ASCII dods response for the new coordinate variable, we see all of the parsed timestamps and that they are in chronological order:

String fileTime[fileTime = 4] = {"2006-06-07T12:00:00Z",
"2006-06-07T13:00:00Z",
 "2006-06-07T14:00:00Z",
"2006-06-07T15:00:00Z"};

We also check the resulting DDS to see that it is added as a map vector to the Grid as well:

Dataset {
    Grid {
      Array:
        Float32 CGusfc[fileTime = 4][time = 1][altitude = 1][lat = 29][lon = 26]
;
      Maps:
        String fileTime[fileTime = 4];
        Float64 time[time = 1];
        Float32 altitude[altitude = 1];
        Float32 lat[lat = 29];
        Float32 lon[lon = 26];
    } CGusfc;
    String fileTime[fileTime = 4];
} joinNew_scan_dfm.ncml;

Finally, we look at the DAS with global metadata removed:

Attributes {
  CGusfc {
        Float32 _FillValue -1.000000033e+32;
        Float32 missing_value -1.000000033e+32;
        Int32 numberOfObservations 303;
        Float32 actual_range -0.2876400054, 0.2763200104;
        fileTime {
--->            String _CoordinateAxisType "Time";
        }
        CGusfc {
        }
        time {
            String long_name "End Time";
            String standard_name "time";
            String units "seconds since 1970-01-01T00:00:00Z";
            Float64 actual_range 1149681600.0000000, 1149681600.0000000;
        }
        altitude {
            String long_name "Altitude";
            String standard_name "altitude";
            String units "m";
            Float32 actual_range 0.000000000, 0.000000000;
        }
        lat {
            String long_name "Latitude";
            String standard_name "latitude";
            String units "degrees_north";
            String point_spacing "even";
            Float32 actual_range 37.26869965, 38.02470016;
            String coordsys "geographic";
        }
        lon {
            String long_name "Longitude";
            String standard_name "longitude";
            String units "degrees_east";
            String point_spacing "even";
            Float32 actual_range 236.5800018, 237.4799957;
            String coordsys "geographic";
        }
    }
    fileTime {
--->     String _CoordinateAxisType "Time";
    }
}

We see that the aggregation has also automatically added the "_CoordinateAxisType" attribute and set it to "Time" (denoted by the "-→") as defined by the NcML 2.2 specification. The author may add other metadata to the new coordinate variable as discussed previously.

Order of Inclusion

In cases where a dateFormatMark is not specified, the member datasets are added to the aggregation in alphabetical order on the full pathname. This is important in the case of subdirectories since the path of the subdirectory is taken into account in the sort.

In cases where a dateFormatMark is specified, the extracted ISO 8601 timestamp is used as the sorting criterion, with older files being added before newer files.

11.E.8. Grid Metadata Tutorial
An Example of Adding Metadata to a Grid

We will go through a basic example of adding metadata to all the possible scopes in a Grid variable:

  • The top-level Grid Structure itself
  • The data Array in the Grid
  • Each Map vector in the Grid

We will also modify the global dataset attribute container to elucidate the difference between an attribute Structure and a variable Structure.

Let’s start with a "pass-through" NcML file which wraps a Netcdf dataset that Hyrax represents as a Grid. This will let us see the exact structure of the data we will want to modify (which may be slightly different than the wrapped dataset due to legacy issues with how shared dimensions are represented, etc):

<?xml version="1.0" encoding="UTF-8"?>
<netcdf location="data/ncml/agg/grids/f97182070958.hdf" title="This file results in a Grid">
<!-- This space intentionally left blank! -->
</netcdf>

This gives the DDS:

Dataset {
    Grid {
      Array:
        UInt32 dsp_band_1[lat = 1024][lon = 1024];
      Maps:
        Float64 lat[1024];
        Float64 lon[1024];
    } dsp_band_1;
} grid_attributes_2.ncml;

and the (extensive) DAS:

Attributes {
    HDF_GLOBAL {
        UInt16 dsp_SubImageId 0;
        String dsp_SubImageName "N/A";
        Int32 dsp_ModificationDate 20040416;
        Int32 dsp_ModificationTime 160521;
        Int32 dsp_SubImageFlag 64;
        String dsp_SubImageTitle "Ingested by SCRIPP";
        Int32 dsp_StartDate 19970701;
        Float32 dsp_StartTime 70958.5;
        Int32 dsp_SizeX 1024;
        Int32 dsp_SizeY 1024;
        Int32 dsp_OffsetX 0;
        Int32 dsp_RecordLength 2048;
        Byte dsp_DataOrganization 64;
        Byte dsp_NumberOfBands 1;
        String dsp_ing_tiros_ourid "NO14****C\\217\\345P?\\253\\205\\037";
        UInt16 dsp_ing_tiros_numscn 44305;
        UInt16 dsp_ing_tiros_idsat 2560;
        UInt16 dsp_ing_tiros_iddata 768;
        UInt16 dsp_ing_tiros_year 24832;
        UInt16 dsp_ing_tiros_daysmp 46592;
        Int32 dsp_ing_tiros_milsec 1235716353;
        Int32 dsp_ing_tiros_slope 1075636998, 551287046, -426777345, -1339034123, 5871604;
        Int32 dsp_ing_tiros_intcpt 514263295, 1892553983, -371365632, 9497638, -2140793044;
        UInt16 dsp_ing_tiros_tabadr 256, 512, 768;
        UInt16 dsp_ing_tiros_cnlins 256;
        UInt16 dsp_ing_tiros_cncols 256;
        UInt16 dsp_ing_tiros_czncs 8;
        UInt16 dsp_ing_tiros_line 256;
        UInt16 dsp_ing_tiros_icol 0;
        String dsp_ing_tiros_date0 "23-MAY-10 13:54:29\\030";
        String dsp_ing_tiros_time0 "13:54:29\\030";
        UInt16 dsp_ing_tiros_label 14112, 12576, 14137;
        UInt16 dsp_ing_tiros_nxtblk 1280;
        UInt16 dsp_ing_tiros_datblk 1280;
        UInt16 dsp_ing_tiros_itape 256;
        UInt16 dsp_ing_tiros_cbias 0;
        UInt16 dsp_ing_tiros_ccoeff 0;
        Int32 dsp_ing_tiros_pastim 1235716353;
        UInt16 dsp_ing_tiros_passcn 3840;
        UInt16 dsp_ing_tiros_lostct 0;
        UInt16 dsp_ing_tiros_lost 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0;
        UInt16 dsp_ing_tiros_ndrll 1280;
        UInt16 dsp_ing_tiros_ndrrec 3840, 5376, 6912, 8448, 9984, 0, 0, 0, 0, 0;
        UInt16 dsp_ing_tiros_ndrlat 46110, 44318, 42526, 40478, 38686, 0, 0, 0, 0, 0;
        UInt16 dsp_ing_tiros_ndrlon 49891, 48611, 47075, 45539, 44259, 0, 0, 0, 0, 0;
        UInt16 dsp_ing_tiros_chncnt 1280;
        UInt16 dsp_ing_tiros_chndsq 8, 8, 8, 8, 8;
        UInt16 dsp_ing_tiros_czncs2 4;
        UInt16 dsp_ing_tiros_wrdsiz 512;
        UInt16 dsp_ing_tiros_nchbas 256;
        UInt16 dsp_ing_tiros_nchlst 1280;
        Float32 dsp_ing_tiros_rpmclc 0;
        UInt16 dsp_ing_tiros_numpix 8;
        UInt16 dsp_ing_tiros_scnden 256;
        UInt16 dsp_ing_tiros_eltden 256;
        UInt16 dsp_ing_tiros_orbtno 23858;
        Int32 dsp_ing_tiros_slope2 1075636998, 551287046, -426777345, -1339034123, 5871604;
        Int32 dsp_ing_tiros_intcp2 514263295, 1892553983, -371365632, 9497638, -2140793044;
        Float32 dsp_ing_tiros_prtemp 3.0811e+10;
        Float32 dsp_ing_tiros_timerr 5.6611e-20;
        UInt16 dsp_ing_tiros_timstn 8279;
        String dsp_nav_xsatid "NO14\\005\\002";
        Byte dsp_nav_xsatty 5;
        Byte dsp_nav_xproty 2;
        Byte dsp_nav_xmapsl 0;
        Byte dsp_nav_xtmpch 4;
        Float32 dsp_nav_ximgdy 97182;
        Float32 dsp_nav_ximgtm 70954.4;
        Float32 dsp_nav_xorbit 12893;
        Float32 dsp_nav_ximgcv 71.1722, 0, 4.88181, 0, -112.11, 0, -27.9583, 0;
        Float32 dsp_nav_earth_linoff 0;
        Float32 dsp_nav_earth_pixoff 0;
        Float32 dsp_nav_earth_scnstr 1;
        Float32 dsp_nav_earth_scnstp 1024;
        Float32 dsp_nav_earth_pixstr 1;
        Float32 dsp_nav_earth_pixstp 1024;
        Float32 dsp_nav_earth_latorg 0;
        Float32 dsp_nav_earth_lonorg 0;
        Float32 dsp_nav_earth_orgrot 0;
        Float32 dsp_nav_earth_lattop 0;
        Float32 dsp_nav_earth_latbot 0;
        Float32 dsp_nav_earth_latcen 38;
        Float32 dsp_nav_earth_loncen -70;
        Float32 dsp_nav_earth_height 66.3444;
        Float32 dsp_nav_earth_width 84.2205;
        Float32 dsp_nav_earth_level 1;
        Float32 dsp_nav_earth_xspace 5.99902;
        Float32 dsp_nav_earth_yspace 5.99902;
        String dsp_nav_earth_rev " 0.1";
        Float32 dsp_nav_earth_dflag 0;
        Float32 dsp_nav_earth_toplat 71.1722;
        Float32 dsp_nav_earth_botlat 4.88181;
        Float32 dsp_nav_earth_leflon -112.11;
        Float32 dsp_nav_earth_ritlon -27.9583;
        Float32 dsp_nav_earth_numpix 1024;
        Float32 dsp_nav_earth_numras 1024;
        Float32 dsp_nav_earth_magxx 6;
        Float32 dsp_nav_earth_magyy 6;
        Int32 dsp_hgt_llnval 18;
        Int32 dsp_hgt_lltime 25744350;
        Float32 dsp_hgt_llvect 869.428, 1.14767, 868.659, 1.09635, 867.84, 1.04502, 866.979, 0.9937, 866.084, 0.942374, 865.165, 0.891045, 864.231, 0.839715, 863.292, 0.788383, 862.356, 0.737049, 861.434, 0.685714, 860.536, 0.634378, 859.67, 0.58304, 858.847, 0.531702, 858.075, 0.480362, 857.363, 0.429022, 856.718, 0.377682, 856.148, 0.326341, 855.66, 0.275, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0;
        String history "\\001PATHNLC May 23 22:40:54 2000 PATHNLC t,3,269.16,0.125,0.,0.01,271.16,308.16,,,,1,,,2,,,3,,,,,,4,,,,,,,2.,,35.,0.1,5,,,,,,,2.,,35.,0.15,55.,80.,0.005,20,,,-2,6.,t,,,,,,,,,,16,,3.5 allb=0 nlsst=1 in=/pathfdr5//97182070958.N14@INGEST@ in1=/pathfdr10/mask/oi.9727.mean out=/pathfdr4/nlc/f97182070958.FMG@0\\012\\004PATHNLC  NLSST Temp calculation date: April 10, 1996\\012\\001OISST Jan 12 17:53:43 1998 OISST  /usr3/gacsst/maketc/oi/dinp/oi.comp.bias.1997,/usr3/gacsst/maketc/oi/dout/oi.97,-3.,0.15,oi.dates.97,0\\012\\004OISST 26 97 06 22 97 06 28  7        472\\012\\001STATS Jan 12 18:27:34 1998 STATS minpix=1 maxpix=255 in=/usr3/gacsst/maketc/oi/dout//oi.9726 \\011  audit=t, callim=f, cal=f, cloud=f \\011  outm=/usr3/gacsst/etc/oi/oi.9727.mean\\012\\001OISST Jan 12 17:53:43 1998 OISST  /usr3/gacsst/maketc/oi/dinp/oi.comp.bias.1997,/usr3/gacsst/maketc/oi/dout/oi.97,-3.,0.15,oi.dates.97,0\\012\\004OISST 27 97 06 29 97 07 05  7        472\\012\\002STATS /usr3/gacsst/maketc/oi/dout//oi.9727\\012\\001OISST Jan 12 17:53:43 1998 OISST  /usr3/gacsst/maketc/oi/dinp/oi.comp.bias.1997,/usr3/gacsst/maketc/oi/dout/oi.97,-3.,0.15,oi.dates.97,0\\012\\004OISST 27 97 06 29 97 07 05  7        472\\012\\002STATS /usr3/gacsst/maketc/oi/dout//oi.9727\\012\\001OISST Jan 12 17:53:43 1998 OISST  /usr3/gacsst/maketc/oi/dinp/oi.comp.bias.1997,/usr3/gacsst/maketc/oi/dout/oi.97,-3.,0.15,oi.dates.97,0\\012\\004OISST 28 97 07 06 97 07 12  7        472\\012\\002STATS /usr3/gacsst/maketc/oi/dout//oi.9728\\012\\002PATHNLC /pathfdr10/mask/oi.9727.mean\\012\\004PATHNLC  45d coeffs used (1) =    0.759   0.947   0.110   1.460   0.000\\012\\004PATHNLC  45d coeffs used (2) =    1.320   0.952   0.071   0.882   0.000\\012\\004PATHNLC  45d coeffs used (3) =    0.000   0.000   0.000   0.000   0.000\\012\\004PATHNLC  GETOZONE I     0.0900    0.0000\\012\\001REMAP Jun  4 07:59:42 2000 REMAP in=/coral/miami/remaps/sst_8r/file_uZ.FMG out=/coral/miami/remaps/sst_8r/f97182070958.nwa16\\012\\004REMAP Output image pixel, line size =    6144,    6144\\012\\004REMAP Grid spacing (X,Y) = (        6.00,        6.00), Projection Code=     1\\012\\004REMAP center lon,lat,dlon,dlat =       -70.00       38.00        0.01        0.01\\012\\001merge_sb Apr 16 16:05:09 2004 merge_sb in=(file=/NOPP/carlw/atlantic/remaps/nwa16/f97182070958.nwa16, filecheck=/RAID2/sbaker/atlantic/bslines97/f97182070958.nwa16) val=0 valcheck=0 tag=0 out=(file1=/RAID2/sbaker/nwa1024d/NDC/dsp_data/f97182070958.tmp_m2)\\012\\001merge_sb Apr 16 16:05:18 2004 merge_sb in=(file=/RAID2/sbaker/nwa1024d/NDC/dsp_data/f97182070958.tmp_m2, filecheck=/RAID/sbaker/DECLOUD/landmask16.img) val=1 valcheck=2 tag=0 out=(file1=/RAID2/sbaker/nwa6144d/NDC/dsp_data/f97182070958.nwa16)\\012\\001CONVRT Apr 16 16:05:21 2004 CONVRT 1024,1024,0,0,6,6,0,0,f,f,t,16,,SUB,1 in=/RAID2/sbaker/nwa6144d/NDC/dsp_data/f97182070958.nwa16   out=/RAID2/sbaker/nwa1024d/NDC/dsp_data/f97182070958.nwa16\\012\\012@\\000\\000\\000";
    }
    dsp_band_1 {
        Byte dsp_PixelType 1;
        Byte dsp_PixelSize 2;
        UInt16 dsp_Flag 0;
        UInt16 dsp_nBits 16;
        Int32 dsp_LineSize 0;
        String dsp_cal_name "Temperature";
        String units "Temp";
        UInt16 dsp_cal_eqnNumber 2;
        UInt16 dsp_cal_CoeffsLength 8;
        Float32 dsp_cal_coeffs 0.125, -4;
        Float32 scale_factor 0.125;
        Float32 add_off -4;
        dsp_band_1 {
        }
        lat {
            String name "lat";
            String long_name "latitude";
        }
        lon {
            String name "lon";
            String long_name "longitude";
        }
    }
}

Let’s say we want to add the following attributes:

  1. Add an attribute to the HDF_GLOBAL attribute container called "ncml_location" since the file is wrapped by our NcML and the original location being wrapped might not be obvious.
  2. Add the same attribute to the dsp_band_1 Grid itself so it’s easier to see and in case of projections
  3. Add "units" to the Array member variable dsp_band_1 of the Grid that matches the containing Grid’s "units" attribute with value "Temp"
  4. Add "units" to the lat map vector as a String with value "degrees_north"
  5. Add "units" to the lon map vector as a String with value "degrees_east"

First, let’s add the "ncml_location" into the HDF_GLOBAL attribute container. To do this, we need to specify the "scope" of the HDF_GLOBAL attribute container (called a Structure in NcML):

<?xml version="1.0" encoding="UTF-8"?>
<netcdf location="data/ncml/agg/grids/f97182070958.hdf" title="This file results in a Grid">
  <!-- Traverse into the HDF_GLOBAL attribute Structure (container) -->
  <attribute name="HDF_GLOBAL" type="Structure">
    <!-- Specify the new attribute in that scope -->
1)  <attribute name="ncml_location" type="String" value="data/ncml/agg/grids/f97182070958.hdf"/>
  </attribute>
</netcdf>

This results in the following (clipped for clarity) DAS:

Attributes {
    HDF_GLOBAL {
        UInt16 dsp_SubImageId 0;
        ... *** CLIPPED FOR CLARITY ***  ...
1)    String ncml_location "data/ncml/agg/grids/f97182070958.hdf";
    }
    dsp_band_1 {
        Byte dsp_PixelType 1;
        Byte dsp_PixelSize 2;
        UInt16 dsp_Flag 0;
        UInt16 dsp_nBits 16;
        Int32 dsp_LineSize 0;
        String dsp_cal_name "Temperature";
        String units "Temp";
        UInt16 dsp_cal_eqnNumber 2;
        UInt16 dsp_cal_CoeffsLength 8;
        Float32 dsp_cal_coeffs 0.125, -4;
        Float32 scale_factor 0.125;
        Float32 add_off -4;
        dsp_band_1 {
        }
        lat {
            String name "lat";
            String long_name "latitude";
        }
        lon {
            String name "lon";
            String long_name "longitude";
        }
    }
}

We can see at the 1) where the new attribute has been added to HDF_GLOBAL as desired.

Next, we want to add the same attribute to the top-level dsp_band_1 Grid variable. Here’s the NcML:

<?xml version="1.0" encoding="UTF-8"?>
<netcdf location="data/ncml/agg/grids/f97182070958.hdf" title="This file results in a Grid">
  <!-- Traverse into the HDF_GLOBAL attribute Structure (container) -->
 <attribute name="HDF_GLOBAL" type="Structure">
   <!-- Specify the new attribute in that scope -->
   <attribute name="ncml_location" type="String" value="data/ncml/agg/grids/f97182070958.hdf"/>
 </attribute>
 <!-- Traverse into the dsp_band_1 variable Structure (actually a Grid) -->
 <variable name="dsp_band_1" type="Structure">
   <!-- Specify the new attribute in that scope -->
2) <attribute name="ncml_location" type="String" value="data/ncml/agg/grids/f97182070958.hdf"/>
 </variable>
</netcdf>

…which gives the (clipped again) DAS:

Attributes {
    HDF_GLOBAL {
       ... *** CLIPPED FOR CLARITY *** ...
        String ncml_location "data/ncml/agg/grids/f97182070958.hdf";
    }
    dsp_band_1 {
        Byte dsp_PixelType 1;
        Byte dsp_PixelSize 2;
        UInt16 dsp_Flag 0;
        UInt16 dsp_nBits 16;
        Int32 dsp_LineSize 0;
        String dsp_cal_name "Temperature";
        String units "Temp";
        UInt16 dsp_cal_eqnNumber 2;
        UInt16 dsp_cal_CoeffsLength 8;
        Float32 dsp_cal_coeffs 0.125, -4;
        Float32 scale_factor 0.125;
        Float32 add_off -4;
2)    String ncml_location "data/ncml/agg/grids/f97182070958.hdf";
        dsp_band_1 {
        }
        lat {
            String name "lat";
            String long_name "latitude";
        }
        lon {
            String name "lon";
            String long_name "longitude";
        }
    }
}

We have denoted the injected metadata with a 2).

As a learning exercise, let’s say we made a mistake and tried to use <attribute> to specify the dsp_band_1 attribute table:

<?xml version="1.0" encoding="UTF-8"?>
<netcdf location="data/ncml/agg/grids/f97182070958.hdf" title="This file results in a Grid">
  <!-- Traverse into the HDF_GLOBAL attribute Structure (container) -->
 <attribute name="HDF_GLOBAL" type="Structure">
   <!-- Specify the new attribute in that scope -->
   <attribute name="ncml_location" type="String" value="data/ncml/agg/grids/f97182070958.hdf"/>
 </attribute>
 <!-- THIS IS AN ERROR! -->
 <attribute name="dsp_band_1" type="Structure">
   <!-- Specify the new attribute in that scope -->
   <attribute name="ncml_location" type="String" value="data/ncml/agg/grids/f97182070958.hdf"/>
 </attribute>
</netcdf>

Then we get a Parse Error…

<?xml version="1.0" encoding="ISO-8859-1"?>
<response xmlns="http://xml.opendap.org/ns/bes/1.0#" reqID="some_unique_value">
  <getDAS>
      <BESError><Type>3</Type>
           <Message>NCMLModule ParseError: at line 11: Cannot create a new attribute container with name=dsp_band_1 at current scope since a variable with that name already exists.  Scope=</Message>
           <Administrator>admin.email.address@your.domain.name</Administrator><Location><File>AttributeElement.cc</File><Line>277</Line></Location>
      </BESError>
   </getDAS>
</response>

…which basically tells us the problem: we tried to specify an attribute with the same name as the Grid, but dsp_band_1 is a variable already with that name. It is illegal for an attribute and variable at the same scope to have the same name.

Next, we want to add the "units" attribute that is on the Grid itself to the actual data Array inside the Grid (say we know we will be projecting it out with a constraint and don’t want to lose this metadata). The NcML now becomes:

<?xml version="1.0" encoding="UTF-8"?>
<netcdf location="data/ncml/agg/grids/f97182070958.hdf" title="This file results in a Grid">
  <!-- Traverse into the HDF_GLOBAL attribute Structure (container) -->
 <attribute name="HDF_GLOBAL" type="Structure">
   <!-- Specify the new attribute in that scope -->
   <attribute name="ncml_location" type="String" value="data/ncml/agg/grids/f97182070958.hdf"/>
 </attribute>
 <!-- Traverse into the dsp_band_1 variable Structure (actually a Grid) -->
 <variable name="dsp_band_1" type="Structure">
   <!-- Specify the new attribute in the Grid's attribute table -->
   <attribute name="ncml_location" type="String" value="data/ncml/agg/grids/f97182070958.hdf"/>
   <!-- While remaining in the Grid, traverse into the Array dsp_band_1: -->
   <variable name="dsp_band_1">
     <!-- And add the attribute there.  Fully qualified name of this scope is "dsp_band_1.dsp_band_1" -->
3)   <attribute name="units" type="String" value="Temp"/>
   </variable> <!-- Exit the Array variable scope, back to the Grid level -->
 </variable>
</netcdf>

Our modified DAS is now…

Attributes {
    HDF_GLOBAL {
       ... *** CLIPPED FOR CLARITY *** ...
        String ncml_location "data/ncml/agg/grids/f97182070958.hdf";
    }
    dsp_band_1 {
        Byte dsp_PixelType 1;
        Byte dsp_PixelSize 2;
        UInt16 dsp_Flag 0;
        UInt16 dsp_nBits 16;
        Int32 dsp_LineSize 0;
        String dsp_cal_name "Temperature";
        String units "Temp";
        UInt16 dsp_cal_eqnNumber 2;
        UInt16 dsp_cal_CoeffsLength 8;
        Float32 dsp_cal_coeffs 0.125, -4;
        Float32 scale_factor 0.125;
        Float32 add_off -4;
        String ncml_location "data/ncml/agg/grids/f97182070958.hdf";
        dsp_band_1 {
3)        String units "Temp";
        }
        lat {
            String name "lat";
            String long_name "latitude";
        }
        lon {
            String name "lon";
            String long_name "longitude";
        }
    }
}

…where the 3) denotes the newly injected metadata on dsp_band_1.dsp_band_1.

Next, we will add the units to both of the map vectors in the next version of our NcML:

<?xml version="1.0" encoding="UTF-8"?>
<netcdf location="data/ncml/agg/grids/f97182070958.hdf" title="This file results in a Grid">
  <!-- Traverse into the HDF_GLOBAL attribute Structure (container) -->
 <attribute name="HDF_GLOBAL" type="Structure">
   <!-- Specify the new attribute in that scope -->
   <attribute name="ncml_location" type="String" value="data/ncml/agg/grids/f97182070958.hdf"/>
 </attribute>
 <!-- Traverse into the dsp_band_1 variable Structure (actually a Grid) -->
 <variable name="dsp_band_1" type="Structure">
   <!-- Specify the new attribute in the Grid's attribute table -->
   <attribute name="ncml_location" type="String" value="data/ncml/agg/grids/f97182070958.hdf"/>
   <!-- While remaining in the Grid, traverse into the Array dsp_band_1: -->
   <variable name="dsp_band_1">
     <!-- And add the attribute there.  Fully qualified name of this scope is "dsp_band_1.dsp_band_1" -->
     <attribute name="units" type="String" value="Temp"/>
   </variable> <!-- Exit the Array variable scope, back to the Grid level -->
   <!-- Traverse into the lat map vector variable -->
   <variable name="lat">
     <!-- Add the units -->
4)   <attribute name="units" type="String" value="degrees_north"/>
   </variable>
   <!-- Traverse into the lon map vector variable -->
   <variable name="lon">
     <!-- Add the units -->
5)   <attribute name="units" type="String" value="degrees_east"/>
   </variable>
 </variable>
</netcdf>

…where we denote the changed with 4) and 5). Here’s the resulting DAS:

Attributes {
    HDF_GLOBAL {
        ... *** CLIPPED FOR CLARITY *** ...
1)      String ncml_location "data/ncml/agg/grids/f97182070958.hdf";
    }
    dsp_band_1 {
        Byte dsp_PixelType 1;
        Byte dsp_PixelSize 2;
        UInt16 dsp_Flag 0;
        UInt16 dsp_nBits 16;
        Int32 dsp_LineSize 0;
        String dsp_cal_name "Temperature";
        String units "Temp";
        UInt16 dsp_cal_eqnNumber 2;
        UInt16 dsp_cal_CoeffsLength 8;
        Float32 dsp_cal_coeffs 0.125, -4;
        Float32 scale_factor 0.125;
        Float32 add_off -4;
2)       String ncml_location "data/ncml/agg/grids/f97182070958.hdf";
        dsp_band_1 {
3)          String units "Temp";
        }
        lat {
            String name "lat";
            String long_name "latitude";
4)          String units "degrees_north";
        }
        lon {
            String name "lon";
            String long_name "longitude";
5)          String units "degrees_east";
        }
    }
}

…where we have marked all the new metadata we have injected, including the new attributes on the map vectors.

Although we added metadata to the Grid, it is possible to also use the other forms of <attribute> in order to modify existing attributes or remove unwanted or incorrect attributes.

The only place where this syntax varies slightly is in adding metadata to an aggregated Grid. Please see the tutorial section on aggregating grids for more information.

Last Updated: Sep 24, 2019 at 4:35 PM EDT