Table of Contents
Fetching ...

WeatherArchive-Bench: Benchmarking Retrieval-Augmented Reasoning for Historical Weather Archives

Yongan Yu, Xianda Du, Qingchen Hu, Jiahao Liang, Jingwei Ni, Dan Qiang, Kaiyu Huang, Grant McKenzie, Renee Sieber, Fengran Mo

TL;DR

WeatherArchive-Bench introduces the first large-scale benchmark for retrieval-augmented reasoning on historical weather archives, addressing the scarcity of historical-narrative data and the need to extract societal vulnerability and resilience indicators. It pairs a million-plus archival corpus with two tasks—WeatherArchive-Retrieval and WeatherArchive-Assessment—grounded by expert validation and vulnerability/resilience frameworks. The study benchmarks a wide range of retrieval models and LLMs, revealing that lexical sparse methods excel on historical vocabulary while LLMs struggle with nuanced climate reasoning, and showing that re-ranking can boost top-ranked relevance. By providing the data, tasks, and evaluation protocol, WeatherArchive-Bench aims to spur robust climate-focused RAG systems that translate archival narratives into actionable insights for adaptation and disaster preparedness.

Abstract

Historical archives on weather events are collections of enduring primary source records that offer rich, untapped narratives of how societies have experienced and responded to extreme weather events. These qualitative accounts provide insights into societal vulnerability and resilience that are largely absent from meteorological records, making them valuable for climate scientists to understand societal responses. However, their vast scale, noisy digitized quality, and archaic language make it difficult to transform them into structured knowledge for climate research. To address this challenge, we introduce WeatherArchive-Bench, the first benchmark for evaluating retrieval-augmented generation (RAG) systems on historical weather archives. WeatherArchive-Bench comprises two tasks: WeatherArchive-Retrieval, which measures a system's ability to locate historically relevant passages from over one million archival news segments, and WeatherArchive-Assessment, which evaluates whether Large Language Models (LLMs) can classify societal vulnerability and resilience indicators from extreme weather narratives. Extensive experiments across sparse, dense, and re-ranking retrievers, as well as a diverse set of LLMs, reveal that dense retrievers often fail on historical terminology, while LLMs frequently misinterpret vulnerability and resilience concepts. These findings highlight key limitations in reasoning about complex societal indicators and provide insights for designing more robust climate-focused RAG systems from archival contexts. The constructed dataset and evaluation framework are publicly available at https://anonymous.4open.science/r/WeatherArchive-Bench/.

WeatherArchive-Bench: Benchmarking Retrieval-Augmented Reasoning for Historical Weather Archives

TL;DR

WeatherArchive-Bench introduces the first large-scale benchmark for retrieval-augmented reasoning on historical weather archives, addressing the scarcity of historical-narrative data and the need to extract societal vulnerability and resilience indicators. It pairs a million-plus archival corpus with two tasks—WeatherArchive-Retrieval and WeatherArchive-Assessment—grounded by expert validation and vulnerability/resilience frameworks. The study benchmarks a wide range of retrieval models and LLMs, revealing that lexical sparse methods excel on historical vocabulary while LLMs struggle with nuanced climate reasoning, and showing that re-ranking can boost top-ranked relevance. By providing the data, tasks, and evaluation protocol, WeatherArchive-Bench aims to spur robust climate-focused RAG systems that translate archival narratives into actionable insights for adaptation and disaster preparedness.

Abstract

Historical archives on weather events are collections of enduring primary source records that offer rich, untapped narratives of how societies have experienced and responded to extreme weather events. These qualitative accounts provide insights into societal vulnerability and resilience that are largely absent from meteorological records, making them valuable for climate scientists to understand societal responses. However, their vast scale, noisy digitized quality, and archaic language make it difficult to transform them into structured knowledge for climate research. To address this challenge, we introduce WeatherArchive-Bench, the first benchmark for evaluating retrieval-augmented generation (RAG) systems on historical weather archives. WeatherArchive-Bench comprises two tasks: WeatherArchive-Retrieval, which measures a system's ability to locate historically relevant passages from over one million archival news segments, and WeatherArchive-Assessment, which evaluates whether Large Language Models (LLMs) can classify societal vulnerability and resilience indicators from extreme weather narratives. Extensive experiments across sparse, dense, and re-ranking retrievers, as well as a diverse set of LLMs, reveal that dense retrievers often fail on historical terminology, while LLMs frequently misinterpret vulnerability and resilience concepts. These findings highlight key limitations in reasoning about complex societal indicators and provide insights for designing more robust climate-focused RAG systems from archival contexts. The constructed dataset and evaluation framework are publicly available at https://anonymous.4open.science/r/WeatherArchive-Bench/.

Paper Structure

This paper contains 59 sections, 2 equations, 5 figures, 12 tables.

Figures (5)

  • Figure 1: The construction pipeline of the retrieval task in weather archive collections. The process integrates newspaper collection, keyword frequency search, and human verification to construct a high-quality corpus of weather-related articles with relevance judgments for each query.
  • Figure 2: WeatherArchive-Assessment - the construction pipeline of assessment task on societal vulnerability and resilience. GPT-4.1 evaluates retrieved weather articles across multiple criteria, with human verification ensuring quality before generating ground truth answers. This sample case shows the assessment of rainstorm impacts.
  • Figure 3: Performance comparison of LLMs on free-form QA task across various metrics.
  • Figure 4: Word cloud for keywords in weather archives
  • Figure 5: Comparison of LLM free-form QA performance across ROUGE-1 and LLM-Judge metrics