Table of Contents
Fetching ...

CLIMATEAGENT: Multi-Agent Orchestration for Complex Climate Data Science Workflows

Hyeonjae Kim, Chenyue Li, Wen Deng, Mengxi Jin, Wen Huang, Mengqian Lu, Binhang Yuan

TL;DR

ClimateAgent tackles the bottleneck of climate analytics amid massive, heterogeneous data by introducing a specialized multi-agent orchestration framework that decomposes complex questions into executable subtasks executed by domain-aware agents. The Orchestrate-, Plan-, Data-, and Coding-Agents enable dynamic API awareness, persistent context, and a self-correcting execution loop, delivering robust end-to-end climate workflows. On the Climate-Agent-Bench-85 benchmark, ClimateAgent achieves $100\%$ task completion with a report quality of $8.32$, outperforming strong baselines such as GPT-5 ($3.26$) and GitHub Copilot ($6.27$), illustrating substantial gains in planning, coordination, and correction across diverse climate phenomena. This framework advances reproducible, reliable climate science automation, reducing manual effort and enabling rapid, hypothesis-driven analysis across atmospheric rivers, drought, extreme precipitation, heat waves, SST, and tropical cyclones, with practical implications for adaptation and policy.

Abstract

Climate science demands automated workflows to transform comprehensive questions into data-driven statements across massive, heterogeneous datasets. However, generic LLM agents and static scripting pipelines lack climate-specific context and flexibility, thus, perform poorly in practice. We present ClimateAgent, an autonomous multi-agent framework that orchestrates end-to-end climate data analytic workflows. ClimateAgent decomposes user questions into executable sub-tasks coordinated by an Orchestrate-Agent and a Plan-Agent; acquires data via specialized Data-Agents that dynamically introspect APIs to synthesize robust download scripts; and completes analysis and reporting with a Coding-Agent that generates Python code, visualizations, and a final report with a built-in self-correction loop. To enable systematic evaluation, we introduce Climate-Agent-Bench-85, a benchmark of 85 real-world tasks spanning atmospheric rivers, drought, extreme precipitation, heat waves, sea surface temperature, and tropical cyclones. On Climate-Agent-Bench-85, ClimateAgent achieves 100% task completion and a report quality score of 8.32, outperforming GitHub-Copilot (6.27) and a GPT-5 baseline (3.26). These results demonstrate that our multi-agent orchestration with dynamic API awareness and self-correcting execution substantially advances reliable, end-to-end automation for climate science analytic tasks.

CLIMATEAGENT: Multi-Agent Orchestration for Complex Climate Data Science Workflows

TL;DR

ClimateAgent tackles the bottleneck of climate analytics amid massive, heterogeneous data by introducing a specialized multi-agent orchestration framework that decomposes complex questions into executable subtasks executed by domain-aware agents. The Orchestrate-, Plan-, Data-, and Coding-Agents enable dynamic API awareness, persistent context, and a self-correcting execution loop, delivering robust end-to-end climate workflows. On the Climate-Agent-Bench-85 benchmark, ClimateAgent achieves task completion with a report quality of , outperforming strong baselines such as GPT-5 () and GitHub Copilot (), illustrating substantial gains in planning, coordination, and correction across diverse climate phenomena. This framework advances reproducible, reliable climate science automation, reducing manual effort and enabling rapid, hypothesis-driven analysis across atmospheric rivers, drought, extreme precipitation, heat waves, SST, and tropical cyclones, with practical implications for adaptation and policy.

Abstract

Climate science demands automated workflows to transform comprehensive questions into data-driven statements across massive, heterogeneous datasets. However, generic LLM agents and static scripting pipelines lack climate-specific context and flexibility, thus, perform poorly in practice. We present ClimateAgent, an autonomous multi-agent framework that orchestrates end-to-end climate data analytic workflows. ClimateAgent decomposes user questions into executable sub-tasks coordinated by an Orchestrate-Agent and a Plan-Agent; acquires data via specialized Data-Agents that dynamically introspect APIs to synthesize robust download scripts; and completes analysis and reporting with a Coding-Agent that generates Python code, visualizations, and a final report with a built-in self-correction loop. To enable systematic evaluation, we introduce Climate-Agent-Bench-85, a benchmark of 85 real-world tasks spanning atmospheric rivers, drought, extreme precipitation, heat waves, sea surface temperature, and tropical cyclones. On Climate-Agent-Bench-85, ClimateAgent achieves 100% task completion and a report quality score of 8.32, outperforming GitHub-Copilot (6.27) and a GPT-5 baseline (3.26). These results demonstrate that our multi-agent orchestration with dynamic API awareness and self-correcting execution substantially advances reliable, end-to-end automation for climate science analytic tasks.

Paper Structure

This paper contains 45 sections, 2 equations, 6 figures, 3 tables, 1 algorithm.

Figures (6)

  • Figure 1: Overview of the ClimateAgent system architecture. The workflow illustrates how user queries are decomposed and processed by specialized agents, with robust error recovery and context management, to produce comprehensive climate science reports.
  • Figure 2: Qualitative comparison of generated figures for representative tasks. Each row corresponds to a climate task: (1) Drought (DR), (2) Sea Surface Temperature (SST), (3) Extreme Precipitation (EP), (4-5) Tropical Cyclone (TC), (6) Atmospheric River (AR). Columns: (a) Baseline (GPT-5), (b) Baseline (Copilot), (c) Ours, (d) Golden answer.
  • Figure 3: Comparison of array indexing: the baseline code fails due to a shape mismatch in boolean indexing, while our system validates shapes and applies correct masking.
  • Figure 4: Comparison of ERA5 data request formatting: the baseline code fails due to an invalid date range string, while our system programmatically generates and validates correct request parameters, preventing API errors.
  • Figure 5: Comparison of syntax handling: the baseline code fails with a syntax error due to an unmatched brace, while our system's coding agent ensures only syntactically valid code is executed.
  • ...and 1 more figures