Table of Contents
Fetching ...

A Hierarchical Multi-Agent System for Autonomous Discovery in Geoscientific Data Archives

Dmitrii Pantiukhin, Ivan Kuznetsov, Boris Shapkin, Antonia Anna Jost, Thomas Jung, Nikolay Koldunov

TL;DR

PANGAEA-GPT is presented, a hierarchical multi-agent framework designed for autonomous data discovery and analysis that implements a centralized Supervisor-Worker topology with strict data-type-aware routing, sandboxed deterministic code execution, and self-correction via execution feedback, enabling agents to diagnose and resolve runtime errors.

Abstract

The rapid accumulation of Earth science data has created a significant scalability challenge; while repositories like PANGAEA host vast collections of datasets, citation metrics indicate that a substantial portion remains underutilized, limiting data reusability. Here we present PANGAEA-GPT, a hierarchical multi-agent framework designed for autonomous data discovery and analysis. Unlike standard Large Language Model (LLM) wrappers, our architecture implements a centralized Supervisor-Worker topology with strict data-type-aware routing, sandboxed deterministic code execution, and self-correction via execution feedback, enabling agents to diagnose and resolve runtime errors. Through use-case scenarios spanning physical oceanography and ecology, we demonstrate the system's capacity to execute complex, multi-step workflows with minimal human intervention. This framework provides a methodology for querying and analyzing heterogeneous repository data through coordinated agent workflows.

A Hierarchical Multi-Agent System for Autonomous Discovery in Geoscientific Data Archives

TL;DR

PANGAEA-GPT is presented, a hierarchical multi-agent framework designed for autonomous data discovery and analysis that implements a centralized Supervisor-Worker topology with strict data-type-aware routing, sandboxed deterministic code execution, and self-correction via execution feedback, enabling agents to diagnose and resolve runtime errors.

Abstract

The rapid accumulation of Earth science data has created a significant scalability challenge; while repositories like PANGAEA host vast collections of datasets, citation metrics indicate that a substantial portion remains underutilized, limiting data reusability. Here we present PANGAEA-GPT, a hierarchical multi-agent framework designed for autonomous data discovery and analysis. Unlike standard Large Language Model (LLM) wrappers, our architecture implements a centralized Supervisor-Worker topology with strict data-type-aware routing, sandboxed deterministic code execution, and self-correction via execution feedback, enabling agents to diagnose and resolve runtime errors. Through use-case scenarios spanning physical oceanography and ecology, we demonstrate the system's capacity to execute complex, multi-step workflows with minimal human intervention. This framework provides a methodology for querying and analyzing heterogeneous repository data through coordinated agent workflows.
Paper Structure (24 sections, 6 figures)

This paper contains 24 sections, 6 figures.

Figures (6)

  • Figure 1: Conceptual framework of the PANGAEA-GPT Multi-Agent System (MAS). The system uses a two-phase, hierarchical architecture. First, a Search Agent discovers relevant datasets from a user's natural language request. A Supervisor Agent then delegates analysis and visualization tasks to a team of specialist agents working within a secure sandbox environment. Finally, a Writer Agent synthesizes the results into a cohesive report.
  • Figure 2: Exploratory analysis of microplastic distribution in the Weddell Sea (Supplementary Note 1).
  • Figure 3: Validation scatter plot comparing in-situ temperature observations (0--500 m) from ten HAUSGARTEN moorings with co-located values from the Copernicus GLORYS12V1 reanalysis. The Oceanographer Agent retrieved daily thetao fields, performed 4D nearest-grid-point matchup ($N=135{,}678$), and calculated statistics (Bias $+0.35\,^\circ$C, RMSE $1.09\,^\circ$C, $r=0.31$). Color scale indicates observation depth.
  • Figure 4: ERA5/MOSAiC Lagrangian validation and wind regime characterization.
  • Figure 5: Bio-physical coupling generated by the Oceanographer Agent. The system augmented the PANGAEA biological dataset by retrieving and co-locating 4D hydrographic data ($\theta$/$S$) from the Copernicus Marine Service GLORYS12V1 reanalysis. The T-S diagram highlights the co-occurrence of Aglantha digitale with distinct water masses ($\log_{10}$-abundance).
  • ...and 1 more figures