CLLMate: A Multimodal Benchmark for Weather and Climate Events Forecasting
Haobo Li, Zhaowei Wang, Jiachen Wang, Yueya Wang, Alexis Kai Hon Lau, Huamin Qu
TL;DR
This paper introduces the Weather and Climate Event Forecasting (WCEF) task and the CLLMate dataset, which aligns ERA5 meteorological rasters with expert-validated environmental news events to forecast textual weather events and their consequences. It benchmarks 23 multimodal large language models across closed-source, open-source, and fine-tuned variants, revealing that while models can surpass random baselines, performance on consequence forecasting remains limited and highly dependent on task-specific alignment. The results underscore the value of fine-tuning and the need for domain-optimized architectures that can better bridge numerical meteorology with textual narratives. CLLMate serves as a foundational benchmark, enabling future research on integrating multimodal data for actionable, narratively grounded weather and climate forecasting. The work also highlights opportunities to expand modalities and incorporate richer knowledge representations to improve causal reasoning in environmental contexts.
Abstract
Forecasting weather and climate events is crucial for making appropriate measures to mitigate environmental hazards and minimize losses. However, existing environmental forecasting research focuses narrowly on predicting numerical meteorological variables (e.g., temperature), neglecting the translation of these variables into actionable textual narratives of events and their consequences. To bridge this gap, we proposed Weather and Climate Event Forecasting (WCEF), a new task that leverages numerical meteorological raster data and textual event data to predict weather and climate events. This task is challenging to accomplish due to difficulties in aligning multimodal data and the lack of supervised datasets. To address these challenges, we present CLLMate, the first multimodal dataset for WCEF, using 26,156 environmental news articles aligned with ERA5 reanalysis data. We systematically benchmark 23 existing MLLMs on CLLMate, including closed-source, open-source, and our fine-tuned models. Our experiments reveal the advantages and limitations of existing MLLMs and the value of CLLMate for the training and benchmarking of the WCEF task.
