Table of Contents
Fetching ...

S2SServiceBench: A Multimodal Benchmark for Last-Mile S2S Climate Services

Chenyue Li, Wen Deng, Zhuotao Sun, Mengxi Jin, Hanzhe Cui, Han Li, Shentong Li, Man Kit Yu, Ming Long Lai, Yuhao Yang, Mengqian Lu, Binhang Yuan

TL;DR

Subseasonal-to-seasonal (S2S) forecasts are increasingly delivered as service products, but translating them into actionable, uncertainty-aware decisions remains a last-mile bottleneck. The authors introduce S2SServiceBench, a multimodal benchmark drawn from an operational climate-service system to evaluate whether current MLLMs and agentic workflows can generate decision-support deliverables across 10 service products, six domains, and three service levels. Tasks are framed as schema-constrained outputs (SSC for signal extraction and SRG for structured reports) and assessed with both direct prompting and a standardized agentic workflow, revealing strong product-dependent variability and persistent bottlenecks in actionable signal comprehension, uncertainty-aware handoffs, and long-horizon planning under dynamic hazards. The findings suggest that generic prompts and scaffolds are insufficient, advocating for climate-service–specific agents with domain-aware representations, guardrails, and tooling to reliably produce operationally compliant last-mile outputs. S2SServiceBench thus provides a concrete evaluation framework to drive the development of dedicated climate-service agents and improve the practical utility of S2S forecasts for resilience and sustainability planning.

Abstract

Subseasonal-to-seasonal (S2S) forecasts play an essential role in providing a decision-critical weeks-to-months planning window for climate resilience and sustainability, yet a growing bottleneck is the last-mile gap: translating scientific forecasts into trusted, actionable climate services, requiring reliable multimodal understanding and decision-facing reasoning under uncertainty. Meanwhile, multimodal large language models (MLLMs) and corresponding agentic paradigms have made rapid progress in supporting various workflows, but it remains unclear whether they can reliably generate decision-making deliverables from operational service products (e.g., actionable signal comprehension, decision-making handoff, and decision analysis & planning) under uncertainty. We introduce S2SServiceBench, a multimodal benchmark for last-mile S2S climate services curated from an operational climate-service system to evaluate this capability. S2SServiceBenchcovers 10 service products with about 150+ expert-selected cases in total, spanning six application domains - Agriculture, Disasters, Energy, Finance, Health, and Shipping. Each case is instantiated at three service levels, yielding around 500 tasks and 1,000+ evaluation items across climate resilience and sustainability applications. Using S2SServiceBench, we benchmark state-of-the-art MLLMs and agents, and analyze performance across products and service levels, revealing persistent challenges in S2S service plot understanding and reasoning - namely, actionable signal comprehension, operationalizing uncertainty into executable handoffs, and stable, evidence-grounded analysis and planning for dynamic hazards-while offering actionable guidance for building future climate-service agents.

S2SServiceBench: A Multimodal Benchmark for Last-Mile S2S Climate Services

TL;DR

Subseasonal-to-seasonal (S2S) forecasts are increasingly delivered as service products, but translating them into actionable, uncertainty-aware decisions remains a last-mile bottleneck. The authors introduce S2SServiceBench, a multimodal benchmark drawn from an operational climate-service system to evaluate whether current MLLMs and agentic workflows can generate decision-support deliverables across 10 service products, six domains, and three service levels. Tasks are framed as schema-constrained outputs (SSC for signal extraction and SRG for structured reports) and assessed with both direct prompting and a standardized agentic workflow, revealing strong product-dependent variability and persistent bottlenecks in actionable signal comprehension, uncertainty-aware handoffs, and long-horizon planning under dynamic hazards. The findings suggest that generic prompts and scaffolds are insufficient, advocating for climate-service–specific agents with domain-aware representations, guardrails, and tooling to reliably produce operationally compliant last-mile outputs. S2SServiceBench thus provides a concrete evaluation framework to drive the development of dedicated climate-service agents and improve the practical utility of S2S forecasts for resilience and sustainability planning.

Abstract

Subseasonal-to-seasonal (S2S) forecasts play an essential role in providing a decision-critical weeks-to-months planning window for climate resilience and sustainability, yet a growing bottleneck is the last-mile gap: translating scientific forecasts into trusted, actionable climate services, requiring reliable multimodal understanding and decision-facing reasoning under uncertainty. Meanwhile, multimodal large language models (MLLMs) and corresponding agentic paradigms have made rapid progress in supporting various workflows, but it remains unclear whether they can reliably generate decision-making deliverables from operational service products (e.g., actionable signal comprehension, decision-making handoff, and decision analysis & planning) under uncertainty. We introduce S2SServiceBench, a multimodal benchmark for last-mile S2S climate services curated from an operational climate-service system to evaluate this capability. S2SServiceBenchcovers 10 service products with about 150+ expert-selected cases in total, spanning six application domains - Agriculture, Disasters, Energy, Finance, Health, and Shipping. Each case is instantiated at three service levels, yielding around 500 tasks and 1,000+ evaluation items across climate resilience and sustainability applications. Using S2SServiceBench, we benchmark state-of-the-art MLLMs and agents, and analyze performance across products and service levels, revealing persistent challenges in S2S service plot understanding and reasoning - namely, actionable signal comprehension, operationalizing uncertainty into executable handoffs, and stable, evidence-grounded analysis and planning for dynamic hazards-while offering actionable guidance for building future climate-service agents.
Paper Structure (30 sections, 3 figures, 5 tables)

This paper contains 30 sections, 3 figures, 5 tables.

Figures (3)

  • Figure 1: The overview of Subseasonal-to-seasonal (S2S) scope. Top panel: the "last-mile" gap---S2S prediction providers struggle to tailor predictions and communicate confidence, while users struggle to interpret uncertainty and act on it yang2026last. Middle panel: this gap is partially bridged by curating S2S predictions into S2S service products, which better support users' decisions. Bottom panel: to further narrow the gap, an S2S service agent can interact with these service products and deliver actionable S2S decision support for end users.
  • Figure 2: Examples of operational S2S service products used as benchmark inputs. These products illustrate the multimodal structure common in practice: index/anomaly maps paired with categorical tiers and uncertainty cues (e.g., confidence overlays), which models must read and convert into decision-making structured deliverables in S2SServiceBench. Left: Drought Outlook (SPI-1). SPI continuous values and SPI categories with an ensemble-confidence overlay (hatched; confidence $\ge 0.5$), indicating regions with strong inter-model agreement. Right: NDVI Outlook. NDVI anomaly outlook and NDVI risk outlook with categorical risk tiers (low/moderate/high) for Jan. 2026 at one-month lead.
  • Figure 3: Overview of S2SServiceBench. We curate recurring S2S service products from an operational climate-service system and package each operational instance as a case containing practitioner-facing multimodal artifacts with initialization and valid-time metadata. Each case is instantiated into three service task levels (Section \ref{['sec:taxonomy_and_level']}). Tasks are evaluated via (i) SSC with checkable fields and (ii) SRG with schema-constrained report-style outputs graded by a rubric (Appendix \ref{['sec:oed_rubric']}), yielding fine-grained evaluation items for diagnosis across products, service levels, and task types.