NOAA maintains thousands of datasets of various kinds (remote sensing, in-situ, derived etc.) across multiple domains (climate, weather, ecology, etc.) consumed by a wide variety of users (scientists, engineers, urban planners, etc.), but discovering, accessing, and using these datasets remains a significant challenge. Recent developments in cloud-native geospatial technologies and standards have made geospatial data easier to access and manage now than ever before. Now the question is: can we go a step further and also make it trivial to query and analyze this data using modern large language models (LLMs)? 

In this blog post, we present some highlights from our ongoing open-source work aimed at answering precisely these questions, as part of a NOAA-funded study. The goal is to build a natural-language interface that allows users to not only search for datasets (“show me datasets related to sea surface temperature”), but also extract arbitrary insights from them (“What are the trends in sea surface temperatures around near-coast gulf waters over the past several decades?”) and get factually-grounded answers derived from real data that are not only correct and comprehensive, but also include detailed provenance information so that users may verify them.

This is a deeply interesting problem that touches on areas such as cloud-native formats, multi-agent workflows, LLM evaluations, knowledge representation, programming language design, distributed computing, and user experience design. Here, we limit ourselves to sharing some highlights from the following aspects of our current approach:

  • Analyzing data
  • Finding the right dataset
  • Orchestrating a multi-agent workflow
  • Presenting the answers and steps in a way that users can verify

We note that the project is still ongoing and details are subject to change.

Example response for a question about sea surface temperature. The left side of the image includes a plan for answering the question, and the right side includes an interactive heat map with orange, yellow, and purple shaded in regions around the coast south of Louisiana to show sea surface temperature.
Example response for the question “Show me the sea surface temperature within 25 miles off the coast of Louisiana on June 01, 2024 and compute its average.” The output includes: a plan for answering the question, a visualization rendered on the interactive map, a text response, and detailed provenance information (shown in a subsequent figure).

Data Analysis

Climate data is usually distributed as daily or monthly NetCDF files which can number in the tens of thousands or more for long-lived datasets. The modern cloud-native geospatial (CNG) stack provides powerful tools to handle such data:

  • Virtual Zarr stores allow these disparate files to be consolidated into a single virtual dataset
  • Xarray allows these virtual datasets to be transformed
  • Dask allows these transformation to be applied to vast amounts of data in an efficient, distributed manner

What we are concerned with in our application is figuring out the right transformations to apply to these datasets to precisely answer any question that a user might ask. This is the kind of highly complex reasoning task that would have been impossible to perform with any kind of reliability in the past, but is the sort of thing that state-of-the-art LLMs have been getting increasingly better at.

How exactly we use an LLM agent to solve such problems is an interesting question. Possible approaches range from generating and executing raw Python code to merely choosing between existing human-written algorithms for different kinds of problems. The approach that we have settled on for now is a compromise between the flexibility of the former and the reliability of the latter: for any given user query, we ask the “algorithm agent” (visualized below) to construct a bespoke algorithm by composing together pre-defined human-written functions. The output is a structured JSON object that we can validate in various ways, including applying static type checking, and then translate to Python code which is then executed on a Dask cluster.

A graphic design with info about algorithm agents. It starts with a robot graphic at the top  with lines leading down to "gulf of mexico", "data array" and "June 2024". Gulf of Mexico feeds into geocode, which feeds into FilterByPolygon into ReduceArray and into PlotRaster. Data Array and June 2024 feed into SliceArray which feeds into FilterByPolygon and then ReduceArray and PlotRaster. On the right there is another box labled "Function Library" with Geocode, BinaryOp, SliceArray, ReduceArray, GroupBy, FilterByPolygon, PlotRaster, and PlotTimeSeries.
The algorithm agent constructs bespoke algorithms for user queries by using pre-defined functions as building blocks.

Data Search

Before we analyze the data, as discussed above, we need to find the right dataset. This is not a simple matter because the users’ questions will usually not contain the actual name of the datasets or the variables within them. Furthermore, there might be multiple datasets that seem relevant to any given question. This suggests a need for intelligence at the search level, which is why we encapsulate the search functionality inside a “data search agent”. This agent queries a knowledge graph containing metadata for various climate datasets represented using standard scientific ontologies.

Graphic diagram with info about Data Search Agent. Robot icon with arrows pointing to it from SPARQL tool and Fetch tool. There are arrows from SPARQL to Fetch also pointing to a knowledge graph of dataset metadata.
The data search agent finds the right datasets by querying a knowledge graph of dataset metadata.

Orchestration

We organize our multi-agent workflow using an orchestrator pattern. The orchestration agent is the only agent that communicates directly with the user and delegates tasks to other agents via special tools. Using tools as the interface for agent-to-agent communication allows us to more precisely define its structure, and thus, validate it. It also forces the orchestration agent to translate potentially ambiguous user queries into precise inputs to the tool thus limiting the spread of these ambiguities into the rest of the workflow.

An Orchestration Agent graphic diagram with a robot graphic and then a box labeled tools with planning, text response, data discovery, and data analysis inside it.
The orchestration agent delegates tasks to other agents via tools.

Presentation

When dealing with complex agentic workflows, especially ones that purport to perform scientific analysis, it is important that all the steps performed be transparent and verifiable. To this end, we surface detailed progress updates and provenance information in the UI, including a pseudocode representation of the algorithm used.

A screenshot demonstrating detailed provenance information for generated answers.
A screenshot demonstrating detailed provenance information for generated answers.

The UI surfaces detailed provenance information for the generated answers.

Big Picture and future development

There is a widespread recognition in the geospatial community of the potential of LLM-powered natural-language, conversational interfaces to accelerate data discovery and analysis, with multiple groups building demos similar to this one for various datasets. In many ways this can be seen as the logical culmination of the painstaking work on standards, data formats, and software tooling that the open source geospatial community has engaged in over many years. It is only with that groundwork in place that we can finally apply modern AI techniques effectively.

But this is also just the start. Now that we know this works, the incentive to “uplift” all datasets into analysis-ready, cloud-optimized formats is greater than ever before. EarthMover’s recently announced data marketplace is a step in this direction. Alongside this, on the AI front, an open question is how best to make these datasets discoverable for AI agents. Well-curated metadata will go a long way in enabling this. Relatedly, exposing datasets and capabilities through public MCP servers and tools will reduce siloing and enable really powerful new workflows.

If you’re interested in discussing this work, or learning more about querying and analyzing geospatial data more generally, we’d love to connect. Reach out to our team any time on our contact us page to share related work or discuss these ideas in greater depth. 

Acknowledgements

This work is a collaboration between Element 84 and TechTraverse as part of the NOAA-funded “Study to Determine Natural Language Processing Capabilities with the NCCF Open Knowledge Mesh” BAA. Project contributors include: Kendall Aubertot, Sean Malone, Jason Gilman, Evan McQuinn, and Steph Wall.