Table of Contents
Fetching ...

AgentCaster: Reasoning-Guided Tornado Forecasting

Michael Chen

TL;DR

AgentCaster presents a contamination-free, end-to-end framework for evaluating multimodal LLMs as interactive meteorologists on real tornado forecasting tasks. It combines high-resolution HRRR-derived maps with on-demand soundings to generate SPC-style probabilistic risk polygons, evaluated with domain-specific metrics TornadoBench and TornadoHallucination over a 40-day benchmark. Results show substantial gaps between state-of-the-art LLMs and human forecasters, with widespread hallucinations and geographic misplacement, underscoring limits in current reasoning agents for critical real-world tasks. By releasing a large, reproducible dataset and code, the work aims to catalyze progress toward reliable AI assistants that can meaningfully augment human expertise in high-stakes weather forecasting.

Abstract

There is a growing need to evaluate Large Language Models (LLMs) on complex, high-impact, real-world tasks to assess their true readiness as reasoning agents. To address this gap, we introduce AgentCaster, a contamination-free framework employing multimodal LLMs end-to-end for the challenging, long-horizon task of tornado forecasting. Within AgentCaster, models interpret heterogeneous spatiotemporal data from a high-resolution convection-allowing forecast archive. We assess model performance over a 40-day period featuring diverse historical data, spanning several major tornado outbreaks and including over 500 tornado reports. Each day, models query interactively from a pool of 3,625 forecast maps and 40,125 forecast soundings for a forecast horizon of 12-36 hours. Probabilistic tornado-risk polygon predictions are verified against ground truths derived from geometric comparisons across disjoint risk bands in projected coordinate space. To quantify accuracy, we propose domain-specific TornadoBench and TornadoHallucination metrics, with TornadoBench highly challenging for both LLMs and domain expert human forecasters. Notably, human experts significantly outperform state-of-the-art models, which demonstrate a strong tendency to hallucinate and overpredict risk intensity, struggle with precise geographic placement, and exhibit poor spatiotemporal reasoning in complex, dynamically evolving systems. AgentCaster aims to advance research on improving LLM agents for challenging reasoning tasks in critical domains.

AgentCaster: Reasoning-Guided Tornado Forecasting

TL;DR

AgentCaster presents a contamination-free, end-to-end framework for evaluating multimodal LLMs as interactive meteorologists on real tornado forecasting tasks. It combines high-resolution HRRR-derived maps with on-demand soundings to generate SPC-style probabilistic risk polygons, evaluated with domain-specific metrics TornadoBench and TornadoHallucination over a 40-day benchmark. Results show substantial gaps between state-of-the-art LLMs and human forecasters, with widespread hallucinations and geographic misplacement, underscoring limits in current reasoning agents for critical real-world tasks. By releasing a large, reproducible dataset and code, the work aims to catalyze progress toward reliable AI assistants that can meaningfully augment human expertise in high-stakes weather forecasting.

Abstract

There is a growing need to evaluate Large Language Models (LLMs) on complex, high-impact, real-world tasks to assess their true readiness as reasoning agents. To address this gap, we introduce AgentCaster, a contamination-free framework employing multimodal LLMs end-to-end for the challenging, long-horizon task of tornado forecasting. Within AgentCaster, models interpret heterogeneous spatiotemporal data from a high-resolution convection-allowing forecast archive. We assess model performance over a 40-day period featuring diverse historical data, spanning several major tornado outbreaks and including over 500 tornado reports. Each day, models query interactively from a pool of 3,625 forecast maps and 40,125 forecast soundings for a forecast horizon of 12-36 hours. Probabilistic tornado-risk polygon predictions are verified against ground truths derived from geometric comparisons across disjoint risk bands in projected coordinate space. To quantify accuracy, we propose domain-specific TornadoBench and TornadoHallucination metrics, with TornadoBench highly challenging for both LLMs and domain expert human forecasters. Notably, human experts significantly outperform state-of-the-art models, which demonstrate a strong tendency to hallucinate and overpredict risk intensity, struggle with precise geographic placement, and exhibit poor spatiotemporal reasoning in complex, dynamically evolving systems. AgentCaster aims to advance research on improving LLM agents for challenging reasoning tasks in critical domains.

Paper Structure

This paper contains 36 sections, 5 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: A simplified overview of the AgentCaster framework. LLM agents act as AI meteorologists by first requesting and analyzing forecast maps, then passing specific longitudes and latitudes which are processed to return targeted atmospheric soundings. Agents reason about severe weather dynamics and when confident, generate probabilistic tornado risk predictions as geospatial polygons. These predictions are evaluated against ground truths derived from observed tornado reports through practically perfect forecasts hitchens_objective_2013 and compared with domain expert SPC forecast baselines.
  • Figure 2: Days with greater than 100 tornado reports.
  • Figure 3: Evaluation of SPC and the top performing model on March 14, 2025. Overlapping solution regions are shaded.