As part of our work writing a STAC + Zarr Report for the Cloud Optimized Geospatial Formats Guide our team is exploring the partially overlapping goals of STAC and Zarr and offering suggestions for how to use them together. This effort is particularly relevant currently due to several recent developments in the space that we discuss later in this post. 

Why are we excited about the interaction between Zarr and STAC now, specifically? 

Neither STAC nor Zarr are particularly new, and the discussion about their overlapping capabilities and how to use them together has been going on for at least 5 years (for instance: here, here, here, here). As ESA plans to embark on large-scale data distribution in Zarr, it feels a good time to pull together best practices that have emerged over the years and how recent developments in Zarr and STAC might lead to new approaches.

Recent Developments

Zarr-3, Virtual Zarr, GeoZarr

Let’s start with the Zarr side of things. The recent release of Zarr v3 and the corresponding implementation in zarr-python includes the Sharding Codec. This means that Zarr can now store more than one chunk within a single object in S3 (this object is called a shard). This enhancement lets you avoid the exploded file hierarchy of previous Zarr versions. This is a significant advancement because requiring one API call per chunk is unwieldy for very large datasets.The new release of zarr-python also improves on this experience by implementing an async API which allows running requests concurrently.

Between Zarr-3, the advent of Icechunk, and recent development on VirtualiZarr there has also been increased interest in virtual Zarr stores recently. Virtual zarr stores (traditionally stored as kerchunk reference files, but now also as Icechunk stores) let data consumers access data from legacy chunked file formats (such as NetCDF and HDF5) as if they were zarr stores (read more about why these formats don’t natively work well in the cloud in Tom Nicholas’ excellent blog post). The takeaway is that users can have cloud-optimized workflows without all the data duplication that is implied by moving data to Zarr.

While this has been going on in the regular Zarr world, the GeoZarr working group has also been cooking up a specification for encoding geospatial data in Zarr (Max Jones’ slides from STAC + Zarr workshop Apr, 2025) – remember that regular Zarr is for any kind of array – similar to how TIFF is for any type of raster while GeoTiff is specifically spatial.

xarray.DataTree

Meanwhile in xarray, xarray.DataTree was merged in as a core data structure. This is an important missing piece because xarray is an essential part of the experience of working with multidimensional data arrays in Python. The DataTree structure lets you construct hierarchies of groups of arrays that have different dimensions and/or resolutions. If you think that sounds like Zarr, you are right. The incorporation of DataTree as a core structure within xarray, means that Zarr is now, officially, a first class citizen.

STAC Collection Search and Link Templates Extension

Typically “searching” in STAC means searching through the items in a catalog. But the STAC API also has an extension defining collection-level search which allows flexible search through the collections in a catalog. Recently collection search has been added to stac-fastapi which makes it easier than ever for STAC catalogs to support it. Support has also been added to pystac-client which makes it easier for users to take advantage of this new functionality. Collection search is great for discovering datasets of interest – even over multiple STAC Catalogs (this is called federated search: here is an example).

Also on the STAC side there has been recent work on Qubed, which is being formalized in the Link Templates Extension. One of the goals of this work is to improve the browsability of nested structures by providing a succinct pattern for referencing a nested path. This has applications for Zarr because it allows direct access to data within subgroups.

Now that we’ve covered recent developments let’s zoom out a bit and get into why it has been challenging to figure out how to use Zarr and STAC together.

How do STAC and Zarr compare?

STAC is a specification for indexing and finding any type of data that has spatial and temporal dimensions. STAC builds off geojson and is often used to store references to satellite imagery stored in COGs.  

The Zarr specification is designed to store self-describing hierarchies of chunked multidimensional arrays. It is often used for storing outputs from earth system models. These datasets tend to be on well-aligned grids representing global regions. 

Both STAC and Zarr support flexible hierarchies of arbitrary metadata, and they share overlapping goals. The main thing to keep in mind is that Zarr is a data format and STAC is not. Sometimes we conflate “STAC” with “COGs stored in a STAC catalog”, but STAC can be used to catalog anything as long as it has spatiotemporal dimensions.

STAC Zarr ( + xarray for last 2 rows)
for data with spatial and temporal dimensions for groups of arrays with any type of dimensions
supports arbitrary metadata for catalogs, collections, items, assets supports arbitrary metadata for groups, arrays
storage of STAC metadata is completely decoupled from storage of data storage of metadata is coupled to data (i.e., in the same directory, except when virtualized)
good for discovering datasets of interest good for filtering within a dataset
searching returns STAC items or collections filtering returns subsets of arrays (potentially composed of parts of multiple chunks)

Although they have many similarities, Zarr does not have a catalog concept. There is nothing above the Zarr store. Data providers have a responsibility to make their data discoverable and in order to do that you have to have a means for people to find it and figure out what is in it. That’s where STAC comes in.

How can Zarr + STAC help Data Consumers?

Data consumers have questions like:

  • How can I discover datasets that might be applicable to my work?
  • I know I want to use a particular dataset – but where can I find it and how do I access it? 

A Zarr store sitting on S3 is not enough to answer either of these questions by itself. The data is open, but it isn’t discoverable. Users would have to already know what and where their data is.

That said, if the same Zarr store were indexed in a STAC Catalog you could search using a term like “sea surface temperature” and find the Zarr store along with information about how to access it.

How can Zarr + STAC help Data Producers? 

In contrast to consumers, data producers are focused on ensuring that their data is up-to-date and minimizing the number and size of requests that are required for people to use it. They need to decide where Zarr stores should go in a STAC catalog, and what metadata should be duplicated from the Zarr dataset into STAC.

The appropriate decision given any of the above considerations will vary based on, among other things, the shape of the data being considered. There is no easy, one-size-fits-all solution.

Under the header "unaligned" there are 7 squares overlapping and seemingly strewn about. Under the header "Aligned" there is a single 3D cube (comprised of several of the 2D squares)

These drawings are a simplification, since in reality you likely have 4D data or possibly many data cubes at different resolutions.

One Big Zarr (Aligned Data)

This is what people tend to think of when they think of Zarr data. These include the ERA5 or CMIP6 datasets that are typically global in scale and have many subgroups. In data processing terms these are Level 3 and Level 4 data.

In this scenario, there is a collection for each dataset with a collection-level asset containing a reference to the Zarr store. There are no STAC Items at all. 

The simplest implementation of One Big Zarr is just this: a collection with no metadata and an asset pointing to the Zarr store on S3. This is already useful! It provides a single location (STAC) where users can find how to access many datasets and without any metadata duplication!

Even so, without metadata the dataset isn’t as discoverable as it could otherwise be.

This image shows two side-by-side graphs. The graph on the left is titled "Data producer implements", and the graph on the right is titled "Data consumer gets".  The Y axis on both graphs shows "usefulness" and the X axis on both graphs shows "effort required to maintain STAC". There is straight line on both graphs trending upwards showing "more metadata in STAC" On the data producer graph there are 3 bubbles. The first one is located in the corner at the intersection of both the x and y axes and it reads, "Collection-level asset pointing to store". The next bubble is slightly higher up on both axes and reads, "collection-level link templates". The third bubble demonstrates less effort required to maintain STAC compared to the second bubble, but is the highest in usefulness and it reads, "collection-level cube:variables". For the Data consumer graph there are also three bubbles in the same positions as the first. The first bubble reads "find how to access known datasets" at the intersection of the axes. The second bubble higher in usefulness and effort to maintain reads "access array of interest". The final bubble is the highest in usefulness but is slightly lower on the X axis than the previous bubble and it reads "Discover new datasets."

To improve discoverability, data producers can use the Datacube Extension to capture the variables contained within the Zarr store. Tom Augspurger even made a tool – xstac – for extracting STAC metadata from xarray objects. This metadata doesn’t change very often, so it is reasonable to maintain and providing it enables users to find new datasets by searching a particular topic (like “sea surface temperature”).

This graphic has a dropdown list of text Starting with STAC catalog. There is a line dropped below that and the title STAC collection. Under that there is the following list, each on a new line: title, extent:, spatial:, temporal:, cube:variables:, cube:dimensions:, ..., STAC Assets, href:link to zarr store. There is an arrow pointing from zarr store to a box labeled Data in object storage (s3-like). Inside the box there is a smaller yellow box on one side that reads "zarr store". On the other side of the data in object storage box it reads, "each store contains an entire dataset. dims: -time -lat -lon -varibles

In this scenario, STAC metadata is purely used for search and discovery, and cannot be used for filtering.

Many Smaller Zarrs (Unaligned Data)

This setup is similar to the common “STAC + COGs” approach, but instead of having one COG per temporal and spatial extent (in satellite terminology: one scene), there is one Zarr store for the scene. This approach works well for unaligned data where each Zarr store might cover a small portion of the globe and have a different coordinate reference system than another Zarr store. This is typically what data looks like at data processing levels 1 and 2. 

In this scenario the STAC Collection represents the whole dataset (for instance Sentinel-2 L2A) and under that collection there are STAC Items representing each scene with an asset pointing to the Zarr store for that scene. Each STAC Item should specify the spatial and temporal extents of the data so that data consumers can determine which Zarr stores they need.

This image shows two side-by-side graphs. The graph on the left is titled "Data producer implements", and the graph on the right is titled "Data consumer gets".  The Y axis on both graphs shows "usefulness" and the X axis on both graphs shows "effort required to maintain STAC". There is straight line on both graphs trending upwards showing "more metadata in STAC". On the data producer graph there are 4 bubbles. The first one is located in the corner at the intersection of both the x and y axes and it reads, "Item-level assets pointing to stores". The next bubble is slightly higher up on the Y axis and much higher up on the X axis and reads, "Item-level cube:variables, cube:dimensions". The third bubble demonstrates less effort required to maintain STAC compared to the second bubble, but is higher in usefulness and it reads, "collection-level summaries and item_assets". For the Data consumer graph there are also four bubbles in the same positions as the first. The first bubble reads "find how to access known datasets" at the intersection of the axes. The second bubble higher in usefulness and effort to maintain reads "explore full metadata". The third bubble is higher in usefulness but is slightly lower on the X axis than the previous bubble and it reads "discover new datasets." The final bubble is higher yet in usefulness, but still lower on the effort/x axis than the second bubble. It reads, "find data store for region of interest."

Once you have item-level assets and item-level extents the Zarr stores are indexed and usable, but to make the data discoverable you can add collection-level summaries and item_assets.These metadata are unlikely to change often so they shouldn’t take much to maintain and they allow users to search for datasets containing variables of interest. 

Additional item-level metadata should be included if it is not already contained within the Zarr store (provenance, for example) or if it is useful for search.

This graphic has a dropdown list of text Starting with STAC catalog. There is a line dropped below that and the title STAC collection. Under that there are several list groupings, each item is a new line. title:, summaries:, items_assets:, ...; STAC item, title:, extent:...., ...; STAC Assets, href:link to zarr store; STAC Item, ...., STAC assets, href: link to zarr store. There is an arrow pointing from both "link to zarr store" items to a box labeled Data in object storage (s3-like). Inside the box there are four smaller yellow boxes on one side that each read "zarr store". On the other side of the data in object storage box it reads, "each store represents one region in space at one time (a scene-granule). dims: -time -lat -lon -varibles

Cloud Optimized Geospatial Formats Guide 

Recently, members of our team at Element 84, in conjunction with folks at Development Seed, have been writing a STAC + Zarr Report for the Cloud Optimized Geospatial Formats Guide. This guide is an existing resource hosted by the Cloud Native Geospatial community that provides a wealth of information for anyone looking to learn about best practices for using geospatial data in the cloud. The new report goes into more detail from both the data producer and consumer side, and you can view the full document (still in development) here

If you’d like to chat more about using STAC and Zarr in your work, our team would love to connect! You can find us any time through our contact us page. 

Acknowledgements

The STAC + Zarr Report that this blogpost came out of was supported by NASA IMPACT as part of the VEDA project. In particular I got tremendous support from my colleagues at Development Seed (Max Jones, Pete Gadomski, Henry Rodman, Emmanuel MathotAimee Barciauskas, Sean Harkins), my Element 84 coworkers (Matt Hanson, Nathan Zimmerman, Sara Mack), and from Matthias Mohr. I also benefited from the pioneering work of Tom Augspurger on pushing the limits of STAC in the context of Planetary Computer.