Precipitation Downscaling with Spatiotemporal Video Diffusion

Prakhar Srivastava; Ruihan Yang; Gavin Kerrigan; Gideon Dresdner; Jeremy McGibbon; Christopher Bretherton; Stephan Mandt

Precipitation Downscaling with Spatiotemporal Video Diffusion

Prakhar Srivastava, Ruihan Yang, Gavin Kerrigan, Gideon Dresdner, Jeremy McGibbon, Christopher Bretherton, Stephan Mandt

TL;DR

This work tackles the challenge of high-resolution precipitation downscaling by modeling the full conditional distribution of fine-scale rainfall given coarse-grid inputs. It introduces SpatioTemporal Video Diffusion (STVD), a two-stage framework that deterministically downsamples via a spatio-temporal UNet and then adds stochastic, multimodal details through a conditional diffusion model conditioned on both the low-resolution sequence and the downscaled mean. The approach outperforms six strong baselines across multiple metrics (MSE, CRPS, EMD, PE, SAE) on FV3GFS-derived data, and ablations demonstrate the importance of temporal context and additional climate inputs. By preserving extreme-event statistics and fine-scale spatial structure, STVD offers a practical, probabilistic path to downscaling that can support climate risk assessment and regional planning under limited computational budgets.

Abstract

In climate science and meteorology, high-resolution local precipitation (rain and snowfall) predictions are limited by the computational costs of simulation-based methods. Statistical downscaling, or super-resolution, is a common workaround where a low-resolution prediction is improved using statistical approaches. Unlike traditional computer vision tasks, weather and climate applications require capturing the accurate conditional distribution of high-resolution given low-resolution patterns to assure reliable ensemble averages and unbiased estimates of extreme events, such as heavy rain. This work extends recent video diffusion models to precipitation super-resolution, employing a deterministic downscaler followed by a temporally-conditioned diffusion model to capture noise characteristics and high-frequency patterns. We test our approach on FV3GFS output, an established large-scale global atmosphere model, and compare it against six state-of-the-art baselines. Our analysis, capturing CRPS, MSE, precipitation distributions, and qualitative aspects using California and the Himalayas as examples, establishes our method as a new standard for data-driven precipitation downscaling.

Precipitation Downscaling with Spatiotemporal Video Diffusion

TL;DR

Abstract

Paper Structure (27 sections, 2 equations, 11 figures, 3 tables, 2 algorithms)

This paper contains 27 sections, 2 equations, 11 figures, 3 tables, 2 algorithms.

Introduction
Downscaling via Spatiotemporal Video Diffusion
Problem Statement
Solution Sketch
Probabilistic Modeling of Downscaling
Deterministic Downscaling
Stochastic Residual Modeling via Diffusion
Loss Function
Network Architecture
Experiments
Dataset
Training and Testing Details
Baseline Models
Evaluation Metrics
Qualitative and Quantitative Analysis
...and 12 more sections

Figures (11)

Figure 1: Our model's training and inference pipelines: Blue blocks apply to both phases, red blocks to training only, and green blocks to inference only. It deterministically downscales a low-resolution precipitation sequence using spatio-temporal factorized attention and models residuals with conditional diffusion (with factorized attention). Here, $T$ denotes sequence length and $N$ denotes diffusion steps. The parameters ($\theta={\phi,\psi}$) are optimized jointly during training. See \ref{['sec:appendix_arch']} for details.
Figure 2: A qualitative comparison between our proposed model and top baseline for a precipitation event associated with a cold front impinging on the Northern California coast and then the Sierra mountain range (coastline marked in hazy white). Fig. \ref{['fig:topo']} plots the regional topography. The time interval between adjacent frames is 3 hours; the plotted region is $1000 \times 1000$ km. Our model resolves the fine-grid precipitation structure better than the considered baselines. See \ref{['sec:appendix_samples']} for full-page high quality samples from Himalayas and Sierra.
Figure 3: Tradeoff between mean square error and percentile error (see \ref{['subsec:analysis']}). Inference at Himalayan region (see \ref{['fig:topo', 'fig:comparison-him']}).
Figure 4: Distributions of the fine-grid three-hourly average precipitation, for all gridpoints around the globe. The Swin-IR baseline overestimates large precipitation events, whereas all other baselines underestimate key extreme and rare precipitation events. Our model aligns best with the fine-grid ground truth than any the other model. This is also evident with the the EMD and PE metrics discussed in \ref{['tab:quantitative_results']} and \ref{['subsec:eval']}.
Figure 5: Precipitation over two regions (left: Himalayas; right: Northern California coast, same region as \ref{['fig:comparison']}), averaged across a year, for our STVD model and the ground-truth. For each half, the topography of the region is shown in the corresponding top-left whereas the predicted annual average is shown in the corresponding bottom-right. Annually-averaged precipitation is an important indicator of water availability in a region. STVD successfully captures many details of the precipitation that are tied to local topography and are too fine to be resolved the coarse-grid data.
...and 6 more figures

Precipitation Downscaling with Spatiotemporal Video Diffusion

TL;DR

Abstract

Precipitation Downscaling with Spatiotemporal Video Diffusion

Authors

TL;DR

Abstract

Table of Contents

Figures (11)