SynWeather: Weather Observation Data Synthesis across Multiple Regions and Variables via a General Diffusion Transformer
Kaiyi Xu, Junchao Gong, Zhiwang Zhou, Zhangrui Li, Yuandong Pu, Yihao Liu, Ben Fei, Fenghua Ling, Wenlong Zhang, Lei Bai
TL;DR
This work introduces SynWeather, a first standardized dataset enabling unified multi-region and multi-variable weather observation data synthesis, and SynWeatherDiff, a probabilistic diffusion-transformer model guided by text prompts. The model encodes all weather variables into a shared latent space via a general autoencoder and performs conditional denoising in latent space using a ViT-based satellite encoder and CLIP-derived prompts, mitigating over-smoothing and enabling cross-variable complementarity. Across seven synthesis tasks spanning CONUS, Europe, East Asia, and Tropical Cyclone regions, SynWeatherDiff demonstrates strong universal capabilities and often outperforms specialized baselines, with ablations highlighting the value of diverse input channels, task-focused prompts, and cross-variable learning. The dataset and baseline model establish a framework for flexible, region- and variable-agnostic weather data synthesis, with practical implications for nowcasting, data assimilation, and downstream forecasting applications, all under a unified evaluation protocol. $Y_{r,b} = f_{r,b}(X_r)$ and $Y_{r,b} = g(X_r, P_{r,b})$ illustrate the specialized versus general formulation central to the study, while the diffusion objective $\mathcal{L} = \mathbb{E}_{z^t_{r,b},\epsilon,t}[\| \epsilon_\theta(z^t_{r,b},t,X_{r,b},P_{r,b}) - \epsilon \|_2^2]$ underpins the probabilistic generation framework.$
Abstract
With the advancement of meteorological instruments, abundant data has become available. Current approaches are typically focus on single-variable, single-region tasks and primarily rely on deterministic modeling. This limits unified synthesis across variables and regions, overlooks cross-variable complementarity and often leads to over-smoothed results. To address above challenges, we introduce SynWeather, the first dataset designed for Unified Multi-region and Multi-variable Weather Observation Data Synthesis. SynWeather covers four representative regions: the Continental United States, Europe, East Asia, and Tropical Cyclone regions, as well as provides high-resolution observations of key weather variables, including Composite Radar Reflectivity, Hourly Precipitation, Visible Light, and Microwave Brightness Temperature. In addition, we introduce SynWeatherDiff, a general and probabilistic weather synthesis model built upon the Diffusion Transformer framework to address the over-smoothed problem. Experiments on the SynWeather dataset demonstrate the effectiveness of our network compared with both task-specific and general models.
