Table of Contents
Fetching ...

PuYun-LDM: A Latent Diffusion Model for High-Resolution Ensemble Weather Forecasts

Lianjun Wu, Shengchen Zhu, Yuxuan Liu, Liuyu Kai, Xiaoduan Feng, Duomin Wang, Wenshuo Liu, Jingxuan Zhang, Kelvin Li, Bin Wang

TL;DR

PuYun-LDM tackles the diffusability challenges of high-resolution latent diffusion models for ensemble weather forecasting by introducing two key components: a temporally informed conditioning via a 3D Masked AutoEncoder (3D-MAE) and a variable-aware spectral regularization via VA-MFM. The framework models the transition distribution $p(X_t|X_{t-1})$ with a conditional diffusion process, embedding temporal evolution in the latent space and balancing spectral content across meteorological variables. Empirical results on ERA5-based experiments show PuYun-LDM achieves superior RMSE and CRPS relative to ENS at short lead times and remains competitive at longer horizons, with efficient parallel ensemble generation enabling practical global 15-day forecasts on NVIDIA $H200$ GPUs. This work provides a principled pathway for applying latent diffusion models to atmospheric fields by integrating physics-informed temporal conditioning and variable-aware spectral regularization, addressing both diffusibility and heterogeneity in multivariate weather data.

Abstract

Latent diffusion models (LDMs) suffer from limited diffusability in high-resolution (<=0.25°) ensemble weather forecasting, where diffusability characterizes how easily a latent data distribution can be modeled by a diffusion process. Unlike natural image fields, meteorological fields lack task-agnostic foundation models and explicit semantic structures, making VFM-based regularization inapplicable. Moreover, existing frequency-based approaches impose identical spectral regularization across channels under a homogeneity assumption, which leads to uneven regularization strength under the inter-variable spectral heterogeneity in multivariate meteorological data. To address these challenges, we propose a 3D Masked AutoEncoder (3D-MAE) that encodes weather-state evolution features as an additional conditioning for the diffusion model, together with a Variable-Aware Masked Frequency Modeling (VA-MFM) strategy that adaptively selects thresholds based on the spectral energy distribution of each variable. Together, we propose PuYun-LDM, which enhances latent diffusability and achieves superior performance to ENS at short lead times while remaining comparable to ENS at longer horizons. PuYun-LDM generates a 15-day global forecast with a 6-hour temporal resolution in five minutes on a single NVIDIA H200 GPU, while ensemble forecasts can be efficiently produced in parallel.

PuYun-LDM: A Latent Diffusion Model for High-Resolution Ensemble Weather Forecasts

TL;DR

PuYun-LDM tackles the diffusability challenges of high-resolution latent diffusion models for ensemble weather forecasting by introducing two key components: a temporally informed conditioning via a 3D Masked AutoEncoder (3D-MAE) and a variable-aware spectral regularization via VA-MFM. The framework models the transition distribution with a conditional diffusion process, embedding temporal evolution in the latent space and balancing spectral content across meteorological variables. Empirical results on ERA5-based experiments show PuYun-LDM achieves superior RMSE and CRPS relative to ENS at short lead times and remains competitive at longer horizons, with efficient parallel ensemble generation enabling practical global 15-day forecasts on NVIDIA GPUs. This work provides a principled pathway for applying latent diffusion models to atmospheric fields by integrating physics-informed temporal conditioning and variable-aware spectral regularization, addressing both diffusibility and heterogeneity in multivariate weather data.

Abstract

Latent diffusion models (LDMs) suffer from limited diffusability in high-resolution (<=0.25°) ensemble weather forecasting, where diffusability characterizes how easily a latent data distribution can be modeled by a diffusion process. Unlike natural image fields, meteorological fields lack task-agnostic foundation models and explicit semantic structures, making VFM-based regularization inapplicable. Moreover, existing frequency-based approaches impose identical spectral regularization across channels under a homogeneity assumption, which leads to uneven regularization strength under the inter-variable spectral heterogeneity in multivariate meteorological data. To address these challenges, we propose a 3D Masked AutoEncoder (3D-MAE) that encodes weather-state evolution features as an additional conditioning for the diffusion model, together with a Variable-Aware Masked Frequency Modeling (VA-MFM) strategy that adaptively selects thresholds based on the spectral energy distribution of each variable. Together, we propose PuYun-LDM, which enhances latent diffusability and achieves superior performance to ENS at short lead times while remaining comparable to ENS at longer horizons. PuYun-LDM generates a 15-day global forecast with a 6-hour temporal resolution in five minutes on a single NVIDIA H200 GPU, while ensemble forecasts can be efficiently produced in parallel.
Paper Structure (24 sections, 15 equations, 5 figures, 2 tables)

This paper contains 24 sections, 15 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: (Top) Impact of latent dimensionality on MSL forecasting at the first lead time. Although VAE reconstruction error decreases with increasing latent dimension, DiT generation performance degrades at high dimensions, indicating reduced latent diffusability. PuYun-LDM effectively alleviates this issue. (Bottom) Comparison of frequency-based regularization during VAE pretraining. FFM applies fixed low-pass thresholds of $0.25$ to latents and $0.05$ to inputs, whereas VA-MFM uses the same latent threshold but adopts adaptive, per-channel input thresholds. FFM yield highly variable retained energy ratios across variables, resulting in imbalanced regularization. RMSE values for Z500 and MSL are divided by $50$ for visualization clarity.
  • Figure 2: Overview of the PuYun-LDM framework for ensemble weather forecasting. (a) Overall architecture. Historical weather states are encoded by a VAE and a causal 3D-MAE, providing latent representations and temporal conditioning for an auto-regressive diffusion denoiser. (b) Pretraining of 3D-MAE ($k=4$). The encoder uses causal 3D convolutions with caching to enforce temporal causality and extract temporal evolution features. (c) Pretraining of VA-MFM. Channel-wise spectral analysis adaptively selects low-pass thresholds for the target, and high-frequency latent components are masked to suppress variable-specific artifacts and improve diffusability.
  • Figure 3: RMSE, CRPS, SSR, and Rank histograms of model comparison for z500, t850, 10v, and msl. For a fair comparison among models, we evaluate ENS against its corresponding analysis HRES-fc0 and PuYun-LDM against ERA5.
  • Figure 4: Comparison of normalized spectral energy of encoder- and diffusion-generated latents across different frequency bands under varying latent dimensionalities, and the effect on RMSE of progressively masking high-frequency components.
  • Figure 5: Visualization of Hurricane Dorian trajectories at 18:00 UTC on September 6, 2019. The blue curves show predicted track from PuYun-LDM and ENS initialized 1--9 days in advance. The mean, variance and minimum of the ensemble landfall error with respect to the observed landfall are shown in the lower-left corner of each panel. Predicted trajectories whose landfall locations deviate by more than 200 km from the observed landfall are rendered with reduced opacity for clarity.