S2SServiceBench: A Multimodal Benchmark for Last-Mile S2S Climate Services
Chenyue Li, Wen Deng, Zhuotao Sun, Mengxi Jin, Hanzhe Cui, Han Li, Shentong Li, Man Kit Yu, Ming Long Lai, Yuhao Yang, Mengqian Lu, Binhang Yuan
TL;DR
Subseasonal-to-seasonal (S2S) forecasts are increasingly delivered as service products, but translating them into actionable, uncertainty-aware decisions remains a last-mile bottleneck. The authors introduce S2SServiceBench, a multimodal benchmark drawn from an operational climate-service system to evaluate whether current MLLMs and agentic workflows can generate decision-support deliverables across 10 service products, six domains, and three service levels. Tasks are framed as schema-constrained outputs (SSC for signal extraction and SRG for structured reports) and assessed with both direct prompting and a standardized agentic workflow, revealing strong product-dependent variability and persistent bottlenecks in actionable signal comprehension, uncertainty-aware handoffs, and long-horizon planning under dynamic hazards. The findings suggest that generic prompts and scaffolds are insufficient, advocating for climate-service–specific agents with domain-aware representations, guardrails, and tooling to reliably produce operationally compliant last-mile outputs. S2SServiceBench thus provides a concrete evaluation framework to drive the development of dedicated climate-service agents and improve the practical utility of S2S forecasts for resilience and sustainability planning.
Abstract
Subseasonal-to-seasonal (S2S) forecasts play an essential role in providing a decision-critical weeks-to-months planning window for climate resilience and sustainability, yet a growing bottleneck is the last-mile gap: translating scientific forecasts into trusted, actionable climate services, requiring reliable multimodal understanding and decision-facing reasoning under uncertainty. Meanwhile, multimodal large language models (MLLMs) and corresponding agentic paradigms have made rapid progress in supporting various workflows, but it remains unclear whether they can reliably generate decision-making deliverables from operational service products (e.g., actionable signal comprehension, decision-making handoff, and decision analysis & planning) under uncertainty. We introduce S2SServiceBench, a multimodal benchmark for last-mile S2S climate services curated from an operational climate-service system to evaluate this capability. S2SServiceBenchcovers 10 service products with about 150+ expert-selected cases in total, spanning six application domains - Agriculture, Disasters, Energy, Finance, Health, and Shipping. Each case is instantiated at three service levels, yielding around 500 tasks and 1,000+ evaluation items across climate resilience and sustainability applications. Using S2SServiceBench, we benchmark state-of-the-art MLLMs and agents, and analyze performance across products and service levels, revealing persistent challenges in S2S service plot understanding and reasoning - namely, actionable signal comprehension, operationalizing uncertainty into executable handoffs, and stable, evidence-grounded analysis and planning for dynamic hazards-while offering actionable guidance for building future climate-service agents.
