Table of Contents
Fetching ...

SynWeather: Weather Observation Data Synthesis across Multiple Regions and Variables via a General Diffusion Transformer

Kaiyi Xu, Junchao Gong, Zhiwang Zhou, Zhangrui Li, Yuandong Pu, Yihao Liu, Ben Fei, Fenghua Ling, Wenlong Zhang, Lei Bai

TL;DR

This work introduces SynWeather, a first standardized dataset enabling unified multi-region and multi-variable weather observation data synthesis, and SynWeatherDiff, a probabilistic diffusion-transformer model guided by text prompts. The model encodes all weather variables into a shared latent space via a general autoencoder and performs conditional denoising in latent space using a ViT-based satellite encoder and CLIP-derived prompts, mitigating over-smoothing and enabling cross-variable complementarity. Across seven synthesis tasks spanning CONUS, Europe, East Asia, and Tropical Cyclone regions, SynWeatherDiff demonstrates strong universal capabilities and often outperforms specialized baselines, with ablations highlighting the value of diverse input channels, task-focused prompts, and cross-variable learning. The dataset and baseline model establish a framework for flexible, region- and variable-agnostic weather data synthesis, with practical implications for nowcasting, data assimilation, and downstream forecasting applications, all under a unified evaluation protocol. $Y_{r,b} = f_{r,b}(X_r)$ and $Y_{r,b} = g(X_r, P_{r,b})$ illustrate the specialized versus general formulation central to the study, while the diffusion objective $\mathcal{L} = \mathbb{E}_{z^t_{r,b},\epsilon,t}[\| \epsilon_\theta(z^t_{r,b},t,X_{r,b},P_{r,b}) - \epsilon \|_2^2]$ underpins the probabilistic generation framework.$

Abstract

With the advancement of meteorological instruments, abundant data has become available. Current approaches are typically focus on single-variable, single-region tasks and primarily rely on deterministic modeling. This limits unified synthesis across variables and regions, overlooks cross-variable complementarity and often leads to over-smoothed results. To address above challenges, we introduce SynWeather, the first dataset designed for Unified Multi-region and Multi-variable Weather Observation Data Synthesis. SynWeather covers four representative regions: the Continental United States, Europe, East Asia, and Tropical Cyclone regions, as well as provides high-resolution observations of key weather variables, including Composite Radar Reflectivity, Hourly Precipitation, Visible Light, and Microwave Brightness Temperature. In addition, we introduce SynWeatherDiff, a general and probabilistic weather synthesis model built upon the Diffusion Transformer framework to address the over-smoothed problem. Experiments on the SynWeather dataset demonstrate the effectiveness of our network compared with both task-specific and general models.

SynWeather: Weather Observation Data Synthesis across Multiple Regions and Variables via a General Diffusion Transformer

TL;DR

This work introduces SynWeather, a first standardized dataset enabling unified multi-region and multi-variable weather observation data synthesis, and SynWeatherDiff, a probabilistic diffusion-transformer model guided by text prompts. The model encodes all weather variables into a shared latent space via a general autoencoder and performs conditional denoising in latent space using a ViT-based satellite encoder and CLIP-derived prompts, mitigating over-smoothing and enabling cross-variable complementarity. Across seven synthesis tasks spanning CONUS, Europe, East Asia, and Tropical Cyclone regions, SynWeatherDiff demonstrates strong universal capabilities and often outperforms specialized baselines, with ablations highlighting the value of diverse input channels, task-focused prompts, and cross-variable learning. The dataset and baseline model establish a framework for flexible, region- and variable-agnostic weather data synthesis, with practical implications for nowcasting, data assimilation, and downstream forecasting applications, all under a unified evaluation protocol. and illustrate the specialized versus general formulation central to the study, while the diffusion objective underpins the probabilistic generation framework.$

Abstract

With the advancement of meteorological instruments, abundant data has become available. Current approaches are typically focus on single-variable, single-region tasks and primarily rely on deterministic modeling. This limits unified synthesis across variables and regions, overlooks cross-variable complementarity and often leads to over-smoothed results. To address above challenges, we introduce SynWeather, the first dataset designed for Unified Multi-region and Multi-variable Weather Observation Data Synthesis. SynWeather covers four representative regions: the Continental United States, Europe, East Asia, and Tropical Cyclone regions, as well as provides high-resolution observations of key weather variables, including Composite Radar Reflectivity, Hourly Precipitation, Visible Light, and Microwave Brightness Temperature. In addition, we introduce SynWeatherDiff, a general and probabilistic weather synthesis model built upon the Diffusion Transformer framework to address the over-smoothed problem. Experiments on the SynWeather dataset demonstrate the effectiveness of our network compared with both task-specific and general models.

Paper Structure

This paper contains 34 sections, 7 equations, 10 figures, 12 tables.

Figures (10)

  • Figure 1: Overview of datasets and pipelines in weather variable synthesis. Compared to existing single-region, single-variable and deterministic modeling, SynWeather enables general multi-region, multi-variable probabilistic modeling.
  • Figure 2: Overview of SynWeather. SynWeather is a comprehensive dataset that covers four distinct regions and four key weather observation variables, integrating data from six satellite sources as a condition to support seven synthesis tasks. Extensive evaluations are conducted on seven models, comprising both task-specific and general synthesis models.
  • Figure 3: An overview of our SynWeatherDiff. The target variables are projected into a unified latent space using a general autoencoder. The satellite inputs are processed through a ViT-based encoder to extract features. A task-specific text prompt is encoded using a fine-tuned CLIP text encoder. The text tokens serve as conditional information to guide Text-Guided DiT for different weather synthesis tasks.
  • Figure 4: Visual results of the weather synthesis standard tasks by our SynWeatherDiff and other models.
  • Figure 5: The effect of input channel is analyzed across six tasks, focusing on four groups: (1) SWIR (shortwave infrared), (2) WV (water vapor channels), (3) LWIR (longwave infrared channels), and (4) GAS (gas absorption channels).
  • ...and 5 more figures