Table of Contents
Fetching ...

ClimateBench-M: A Multi-Modal Climate Data Benchmark with a Simple Generative Method

Dongqi Fu, Yada Zhu, Zhining Liu, Lecheng Zheng, Xiao Lin, Zihao Li, Liri Fang, Katherine Tieu, Onkar Bhardwaj, Kommy Weldemariam, Hanghang Tong, Hendrik Hamann, Jingrui He

TL;DR

ClimateBench-M proposes a first multi-modal climate benchmark that aligns ERA5 time-series, NOAA extreme weather records, and NASA HLS imagery under a unified spatiotemporal grid, enabling three tasks: weather forecasting, thunderstorm alerting, and crop segmentation. It introduces SGM, an encoder–decoder framework with dual pipelines and a causality-aware training objective that leverages a variational DAG approach and neural Granger causality to deliver strong forecasting, anomaly detection, and segmentation performance. Across experiments, SGM and its persistence-enhanced variant achieve notable improvements over baselines in MAE for forecasting, AUC-ROC for anomaly detection, and IoU/accuracy for crop segmentation. The work demonstrates the value of integrated multi-modal climate benchmarks for advancing robust, generalizable climate modeling and highlights directions for expanding modalities and language-grounded representations. Overall, ClimateBench-M provides a scalable platform and a competitive generative baseline that can drive future climate AI research and practical forecasting improvements.

Abstract

Climate science studies the structure and dynamics of Earth's climate system and seeks to understand how climate changes over time, where the data is usually stored in the format of time series, recording the climate features, geolocation, time attributes, etc. Recently, much research attention has been paid to the climate benchmarks. In addition to the most common task of weather forecasting, several pioneering benchmark works are proposed for extending the modality, such as domain-specific applications like tropical cyclone intensity prediction and flash flood damage estimation, or climate statement and confidence level in the format of natural language. To further motivate the artificial general intelligence development for climate science, in this paper, we first contribute a multi-modal climate benchmark, i.e., ClimateBench-M, which aligns (1) the time series climate data from ERA5, (2) extreme weather events data from NOAA, and (3) satellite image data from NASA HLS based on a unified spatial-temporal granularity. Second, under each data modality, we also propose a simple but strong generative method that could produce competitive performance in weather forecasting, thunderstorm alerts, and crop segmentation tasks in the proposed ClimateBench-M. The data and code of ClimateBench-M are publicly available at https://github.com/iDEA-iSAIL-Lab-UIUC/ClimateBench-M.

ClimateBench-M: A Multi-Modal Climate Data Benchmark with a Simple Generative Method

TL;DR

ClimateBench-M proposes a first multi-modal climate benchmark that aligns ERA5 time-series, NOAA extreme weather records, and NASA HLS imagery under a unified spatiotemporal grid, enabling three tasks: weather forecasting, thunderstorm alerting, and crop segmentation. It introduces SGM, an encoder–decoder framework with dual pipelines and a causality-aware training objective that leverages a variational DAG approach and neural Granger causality to deliver strong forecasting, anomaly detection, and segmentation performance. Across experiments, SGM and its persistence-enhanced variant achieve notable improvements over baselines in MAE for forecasting, AUC-ROC for anomaly detection, and IoU/accuracy for crop segmentation. The work demonstrates the value of integrated multi-modal climate benchmarks for advancing robust, generalizable climate modeling and highlights directions for expanding modalities and language-grounded representations. Overall, ClimateBench-M provides a scalable platform and a competitive generative baseline that can drive future climate AI research and practical forecasting improvements.

Abstract

Climate science studies the structure and dynamics of Earth's climate system and seeks to understand how climate changes over time, where the data is usually stored in the format of time series, recording the climate features, geolocation, time attributes, etc. Recently, much research attention has been paid to the climate benchmarks. In addition to the most common task of weather forecasting, several pioneering benchmark works are proposed for extending the modality, such as domain-specific applications like tropical cyclone intensity prediction and flash flood damage estimation, or climate statement and confidence level in the format of natural language. To further motivate the artificial general intelligence development for climate science, in this paper, we first contribute a multi-modal climate benchmark, i.e., ClimateBench-M, which aligns (1) the time series climate data from ERA5, (2) extreme weather events data from NOAA, and (3) satellite image data from NASA HLS based on a unified spatial-temporal granularity. Second, under each data modality, we also propose a simple but strong generative method that could produce competitive performance in weather forecasting, thunderstorm alerts, and crop segmentation tasks in the proposed ClimateBench-M. The data and code of ClimateBench-M are publicly available at https://github.com/iDEA-iSAIL-Lab-UIUC/ClimateBench-M.

Paper Structure

This paper contains 32 sections, 2 theorems, 23 equations, 5 figures, 5 tables.

Key Result

Lemma A.1

Let $\bm{A}^{(t)}$ be a weighted adjacency matrix (negative weights allowed). $\bm{A}^{(t)}$ has no $N$-length loops, if $\text{Tr}[(\bm{I} + \bm{A}^{(t)} \circ \bm{A}^{(t)})^{N}] - N =0$.

Figures (5)

  • Figure 1: Left: Geographic Distribution of Covered Counties in ClimateBench-M (The number in the circle stands for the aggregation of nearby counties) Right: A Specific Example of Jefferson, Alabama U.S. on 9:00-10:00, 01/05/2017, UTC Time
  • Figure 2: Example of the crop type segmentation task based on NASA HLS and USDA CDL.
  • Figure 3: The Proposed Simple Generative Model (SGM). The upper level of the figure shows the time series forecasting pipeline, and the lower level of the figure shows the image segmentation pipeline. Two pipelines have different choices of encoders and decoders.
  • Figure 4: Bayesian Network of 238 counties at the same hour on two consecutive days in the training data (i.e., May 1st and May 2nd, 2018).
  • Figure 5: Detailed Pipeline of Causality Discovery.

Theorems & Definitions (3)

  • Remark 3.1
  • Lemma A.1
  • Theorem A.2