Table of Contents
Fetching ...

WXImpactBench: A Disruptive Weather Impact Understanding Benchmark for Evaluating Large Language Models

Yongan Yu, Qingchen Hu, Xianda Du, Jiayin Wang, Fengran Mo, Renee Sieber

TL;DR

WXImpactBench introduces a disruptive weather impact understanding benchmark built from a four-stage pipeline on historical newspapers to evaluate LLMs with multi-label classification and ranking-based QA. The dataset and evaluation framework enable systematic assessment of how well large language models grasp the social, economic, and policy consequences of weather events across historical and modern narratives. Across extensive experiments, larger models show stronger but still imperfect performance, with long-context de-noising aiding classification and ranking tasks revealing model- and prompt-based biases. The benchmark offers a practical tool for advancing climate-change adaptation systems by exposing current limitations and guiding domain-specific improvements. The work also emphasizes data quality, annotation guidelines, and ethical considerations for utilizing historical textual corpora in AI research.

Abstract

Climate change adaptation requires the understanding of disruptive weather impacts on society, where large language models (LLMs) might be applicable. However, their effectiveness is under-explored due to the difficulty of high-quality corpus collection and the lack of available benchmarks. The climate-related events stored in regional newspapers record how communities adapted and recovered from disasters. However, the processing of the original corpus is non-trivial. In this study, we first develop a disruptive weather impact dataset with a four-stage well-crafted construction pipeline. Then, we propose WXImpactBench, the first benchmark for evaluating the capacity of LLMs on disruptive weather impacts. The benchmark involves two evaluation tasks, multi-label classification and ranking-based question answering. Extensive experiments on evaluating a set of LLMs provide first-hand analysis of the challenges in developing disruptive weather impact understanding and climate change adaptation systems. The constructed dataset and the code for the evaluation framework are available to help society protect against vulnerabilities from disasters.

WXImpactBench: A Disruptive Weather Impact Understanding Benchmark for Evaluating Large Language Models

TL;DR

WXImpactBench introduces a disruptive weather impact understanding benchmark built from a four-stage pipeline on historical newspapers to evaluate LLMs with multi-label classification and ranking-based QA. The dataset and evaluation framework enable systematic assessment of how well large language models grasp the social, economic, and policy consequences of weather events across historical and modern narratives. Across extensive experiments, larger models show stronger but still imperfect performance, with long-context de-noising aiding classification and ranking tasks revealing model- and prompt-based biases. The benchmark offers a practical tool for advancing climate-change adaptation systems by exposing current limitations and guiding domain-specific improvements. The work also emphasizes data quality, annotation guidelines, and ethical considerations for utilizing historical textual corpora in AI research.

Abstract

Climate change adaptation requires the understanding of disruptive weather impacts on society, where large language models (LLMs) might be applicable. However, their effectiveness is under-explored due to the difficulty of high-quality corpus collection and the lack of available benchmarks. The climate-related events stored in regional newspapers record how communities adapted and recovered from disasters. However, the processing of the original corpus is non-trivial. In this study, we first develop a disruptive weather impact dataset with a four-stage well-crafted construction pipeline. Then, we propose WXImpactBench, the first benchmark for evaluating the capacity of LLMs on disruptive weather impacts. The benchmark involves two evaluation tasks, multi-label classification and ranking-based question answering. Extensive experiments on evaluating a set of LLMs provide first-hand analysis of the challenges in developing disruptive weather impact understanding and climate change adaptation systems. The constructed dataset and the code for the evaluation framework are available to help society protect against vulnerabilities from disasters.

Paper Structure

This paper contains 36 sections, 3 equations, 6 figures, 16 tables.

Figures (6)

  • Figure 1: Climate-related polysemy examples in different narratives.
  • Figure 2: Data Construction Pipeline consists of four main stages: (1) Corpus collection from historical newspapers across two periods. (2) Post-OCR correction for high-quality extraction. (3) Article selection with defined categorization using LDA topic modeling and expert curation. (4) Annotation framework conducted by domain experts with a six-category impact classification scheme for understanding disruptive weather impacts.
  • Figure 3: Example of the text obtained from initial OCR and after our post-OCR correction.
  • Figure 4: The overview of the benchmark framework with two tasks. Six disruptive weather impacts are used as labeling space in the classification task, where the Red texts represent disruptive weather events(e.g., snowstorm, drought, and blizzard), Yellow texts highlight impact descriptions(e.g., damage assessments, resource needs), and Green texts refer to narrative descriptions(e.g., geographical locations).
  • Figure 5: Example of OCR-digitized text from the Illustrated Montreal Gazette, dated February 18, 1885.
  • ...and 1 more figures