HVR-Met: A Hypothesis-Verification-Replaning Agentic System for Extreme Weather Diagnosis

Shuo Tang; Jiadong Zhang; Jian Xu; Gengxian Zhou; Qizhao Jin; Qinxuan Wang; Yi Hu; Ning Hu; Hongchang Ren; Lingli He; Jiaolan Fu; Jingtao Ding; Shiming Xiang; Chenglin Liu

HVR-Met: A Hypothesis-Verification-Replaning Agentic System for Extreme Weather Diagnosis

Shuo Tang, Jiadong Zhang, Jian Xu, Gengxian Zhou, Qizhao Jin, Qinxuan Wang, Yi Hu, Ning Hu, Hongchang Ren, Lingli He, Jiaolan Fu, Jingtao Ding, Shiming Xiang, Chenglin Liu

TL;DR

This work proposes HVR-Met, a multi-agent meteorological diagnostic system characterized by the deep integration of expert knowledge, which facilitates sophisticated iterative reasoning for anomalous meteorological signals during extreme weather events and introduces a novel benchmark focused on atomic-level subtasks.

Abstract

While deep learning-based weather forecasting paradigms have made significant strides, addressing extreme weather diagnostics remains a formidable challenge. This gap exists primarily because the diagnostic process demands sophisticated multi-step logical reasoning, dynamic tool invocation, and expert-level prior judgment. Although agents possess inherent advantages in task decomposition and autonomous execution, current architectures are still hampered by critical bottlenecks: inadequate expert knowledge integration, a lack of professional-grade iterative reasoning loops, and the absence of fine-grained validation and evaluation systems for complex workflows under extreme conditions. To this end, we propose HVR-Met, a multi-agent meteorological diagnostic system characterized by the deep integration of expert knowledge. Its central innovation is the ``Hypothesis-Verification-Replanning'' closed-loop mechanism, which facilitates sophisticated iterative reasoning for anomalous meteorological signals during extreme weather events. To bridge gaps within existing evaluation frameworks, we further introduce a novel benchmark focused on atomic-level subtasks. Experimental evidence demonstrates that the system excels in complex diagnostic scenarios.

HVR-Met: A Hypothesis-Verification-Replaning Agentic System for Extreme Weather Diagnosis

TL;DR

Abstract

Paper Structure (14 sections, 1 equation, 15 figures, 5 tables)

This paper contains 14 sections, 1 equation, 15 figures, 5 tables.

Introduction
Related Work
Methodology
Semi-automatic Construction of the Meteorological Knowledge Base
Multi-Agent Diagnostic Framework
Hypothesis–Verification–Replanning
A Comprehensive Benchmark for Extreme Weather Diagnosis
Experiment
Experiment Settings
Performance Comparison across Diagnostic Workflow Stages
The Results of Automic-level subtasks
Ablation Study
Conclusion
Appendix

Figures (15)

Figure 1: Overview of the HVR-Met Framework. Designed to emulate the professional "Weather Consultation" process, HVR-Met is a multi-agent system that automates extreme weather diagnosis through a dynamic Hypothesis–Verification–Replanning loop. The framework orchestrates seven specialized agents to collaboratively execute diagnostic tasks: the Decomposer for strategic planning, the Data Specialist and Code Executor for rigorous data retrieval and computation, the Plotter and Image Checker for standardized visualization and quality assurance, the Diagnostician for multi-modal abductive reasoning, and the Reporter for synthesizing the final diagnostic report.
Figure 2: The Verification Pipeline for Figure Generation. We evaluate the "meteorological semantic integrity" of agent-generated visualizations via two parallel tracks: (1) Ground Truth Construction (Top Branch): "Gold standard" figures are extracted from meteorological literature, and a VLM generates binary QA pairs (e.g., checking for specific anomalies like vortices) which are rigorously verified by senior forecasters. (2) Agent Evaluation (Bottom Branch): Plotting requirements extracted from the original captions prompt the HVR-Met agent to autonomously generate and execute visualization code. Finally, the generated figure is fed back into the VLM to answer the original validation questions. The final score is quantified as the percentage of semantic alignment between the agent's output and the ground truth logic, ensuring physically consistent diagnostic visualization.
Figure 3: The Evaluation Pipeline for Meteorological Index Computation. This framework quantifies the agent's numerical precision against human-verified standards. The workflow proceeds in three stages: (1) Ground Truth Construction (Top-Left): Situational questions are formulated via LLMs (GPT), while the ground-truth values (GT) are derived from raw data using expert-grade programs. (2) Agent Inference (Top-Right): The HVR-Met agent processes the question to compute a predicted index value (labeled as 'Reply'). (3) Metric Calculation (Bottom): The system evaluates accuracy by calculating the Relative Error (RE) between the Reply and GT. A prediction is accepted as correct only if the absolute relative error is strictly below 0.05.
Figure 4: Performance Evaluation by Task Type. Comparison of model accuracy on (a) Index Calculation tasks and (b) Figure Extraction tasks.
Figure 5: Guide Library Example: metpy.calc.precipitable_water.
...and 10 more figures

HVR-Met: A Hypothesis-Verification-Replaning Agentic System for Extreme Weather Diagnosis

TL;DR

Abstract

HVR-Met: A Hypothesis-Verification-Replaning Agentic System for Extreme Weather Diagnosis

Authors

TL;DR

Abstract

Table of Contents

Figures (15)