Table of Contents
Fetching ...

Precipitation Downscaling with Spatiotemporal Video Diffusion

Prakhar Srivastava, Ruihan Yang, Gavin Kerrigan, Gideon Dresdner, Jeremy McGibbon, Christopher Bretherton, Stephan Mandt

TL;DR

This work tackles the challenge of high-resolution precipitation downscaling by modeling the full conditional distribution of fine-scale rainfall given coarse-grid inputs. It introduces SpatioTemporal Video Diffusion (STVD), a two-stage framework that deterministically downsamples via a spatio-temporal UNet and then adds stochastic, multimodal details through a conditional diffusion model conditioned on both the low-resolution sequence and the downscaled mean. The approach outperforms six strong baselines across multiple metrics (MSE, CRPS, EMD, PE, SAE) on FV3GFS-derived data, and ablations demonstrate the importance of temporal context and additional climate inputs. By preserving extreme-event statistics and fine-scale spatial structure, STVD offers a practical, probabilistic path to downscaling that can support climate risk assessment and regional planning under limited computational budgets.

Abstract

In climate science and meteorology, high-resolution local precipitation (rain and snowfall) predictions are limited by the computational costs of simulation-based methods. Statistical downscaling, or super-resolution, is a common workaround where a low-resolution prediction is improved using statistical approaches. Unlike traditional computer vision tasks, weather and climate applications require capturing the accurate conditional distribution of high-resolution given low-resolution patterns to assure reliable ensemble averages and unbiased estimates of extreme events, such as heavy rain. This work extends recent video diffusion models to precipitation super-resolution, employing a deterministic downscaler followed by a temporally-conditioned diffusion model to capture noise characteristics and high-frequency patterns. We test our approach on FV3GFS output, an established large-scale global atmosphere model, and compare it against six state-of-the-art baselines. Our analysis, capturing CRPS, MSE, precipitation distributions, and qualitative aspects using California and the Himalayas as examples, establishes our method as a new standard for data-driven precipitation downscaling.

Precipitation Downscaling with Spatiotemporal Video Diffusion

TL;DR

This work tackles the challenge of high-resolution precipitation downscaling by modeling the full conditional distribution of fine-scale rainfall given coarse-grid inputs. It introduces SpatioTemporal Video Diffusion (STVD), a two-stage framework that deterministically downsamples via a spatio-temporal UNet and then adds stochastic, multimodal details through a conditional diffusion model conditioned on both the low-resolution sequence and the downscaled mean. The approach outperforms six strong baselines across multiple metrics (MSE, CRPS, EMD, PE, SAE) on FV3GFS-derived data, and ablations demonstrate the importance of temporal context and additional climate inputs. By preserving extreme-event statistics and fine-scale spatial structure, STVD offers a practical, probabilistic path to downscaling that can support climate risk assessment and regional planning under limited computational budgets.

Abstract

In climate science and meteorology, high-resolution local precipitation (rain and snowfall) predictions are limited by the computational costs of simulation-based methods. Statistical downscaling, or super-resolution, is a common workaround where a low-resolution prediction is improved using statistical approaches. Unlike traditional computer vision tasks, weather and climate applications require capturing the accurate conditional distribution of high-resolution given low-resolution patterns to assure reliable ensemble averages and unbiased estimates of extreme events, such as heavy rain. This work extends recent video diffusion models to precipitation super-resolution, employing a deterministic downscaler followed by a temporally-conditioned diffusion model to capture noise characteristics and high-frequency patterns. We test our approach on FV3GFS output, an established large-scale global atmosphere model, and compare it against six state-of-the-art baselines. Our analysis, capturing CRPS, MSE, precipitation distributions, and qualitative aspects using California and the Himalayas as examples, establishes our method as a new standard for data-driven precipitation downscaling.
Paper Structure (27 sections, 2 equations, 11 figures, 3 tables, 2 algorithms)

This paper contains 27 sections, 2 equations, 11 figures, 3 tables, 2 algorithms.

Figures (11)

  • Figure 1: Our model's training and inference pipelines: Blue blocks apply to both phases, red blocks to training only, and green blocks to inference only. It deterministically downscales a low-resolution precipitation sequence using spatio-temporal factorized attention and models residuals with conditional diffusion (with factorized attention). Here, $T$ denotes sequence length and $N$ denotes diffusion steps. The parameters ($\theta={\phi,\psi}$) are optimized jointly during training. See \ref{['sec:appendix_arch']} for details.
  • Figure 2: A qualitative comparison between our proposed model and top baseline for a precipitation event associated with a cold front impinging on the Northern California coast and then the Sierra mountain range (coastline marked in hazy white). Fig. \ref{['fig:topo']} plots the regional topography. The time interval between adjacent frames is 3 hours; the plotted region is $1000 \times 1000$ km. Our model resolves the fine-grid precipitation structure better than the considered baselines. See \ref{['sec:appendix_samples']} for full-page high quality samples from Himalayas and Sierra.
  • Figure 3: Tradeoff between mean square error and percentile error (see \ref{['subsec:analysis']}). Inference at Himalayan region (see \ref{['fig:topo', 'fig:comparison-him']}).
  • Figure 4: Distributions of the fine-grid three-hourly average precipitation, for all gridpoints around the globe. The Swin-IR baseline overestimates large precipitation events, whereas all other baselines underestimate key extreme and rare precipitation events. Our model aligns best with the fine-grid ground truth than any the other model. This is also evident with the the EMD and PE metrics discussed in \ref{['tab:quantitative_results']} and \ref{['subsec:eval']}.
  • Figure 5: Precipitation over two regions (left: Himalayas; right: Northern California coast, same region as \ref{['fig:comparison']}), averaged across a year, for our STVD model and the ground-truth. For each half, the topography of the region is shown in the corresponding top-left whereas the predicted annual average is shown in the corresponding bottom-right. Annually-averaged precipitation is an important indicator of water availability in a region. STVD successfully captures many details of the precipitation that are tied to local topography and are too fine to be resolved the coarse-grid data.
  • ...and 6 more figures