Table of Contents
Fetching ...

FLEX: A Backbone for Diffusion-Based Modeling of Spatio-temporal Physical Systems

N. Benjamin Erichson, Vinicius Mikuni, Dongwei Lyu, Yang Gao, Omri Azencot, Soon Hoe Lim, Michael W. Mahoney

TL;DR

FLEX tackles the challenge of modeling high-dimensional spatio-temporal physical systems with diffusion, introducing a residual-space, velocity-parametrized diffusion backbone that embeds a latent Transformer within a U-Net. It achieves hierarchical conditioning via a task-specific encoder, enabling both weak and strong conditioning to balance diversity and fidelity. Theoretical analysis indicates residual-space initialization reduces the variance of the optimal velocity field, improving stability, while experiments on 2048×2048 2D turbulence demonstrate state-of-the-art super-resolution and forecasting with calibrated uncertainty and zero-shot generalization to unseen observables and boundary conditions. Overall, FLEX provides a scalable, uncertainty-aware framework that integrates global context modeling with local detail for physics-guided generative modeling of complex flows.

Abstract

We introduce FLEX (FLow EXpert), a backbone architecture for generative modeling of spatio-temporal physical systems using diffusion models. FLEX operates in the residual space rather than on raw data, a modeling choice that we motivate theoretically, showing that it reduces the variance of the velocity field in the diffusion model, which helps stabilize training. FLEX integrates a latent Transformer into a U-Net with standard convolutional ResNet layers and incorporates a redesigned skip connection scheme. This hybrid design enables the model to capture both local spatial detail and long-range dependencies in latent space. To improve spatio-temporal conditioning, FLEX uses a task-specific encoder that processes auxiliary inputs such as coarse or past snapshots. Weak conditioning is applied to the shared encoder via skip connections to promote generalization, while strong conditioning is applied to the decoder through both skip and bottleneck features to ensure reconstruction fidelity. FLEX achieves accurate predictions for super-resolution and forecasting tasks using as few as two reverse diffusion steps. It also produces calibrated uncertainty estimates through sampling. Evaluations on high-resolution 2D turbulence data show that FLEX outperforms strong baselines and generalizes to out-of-distribution settings, including unseen Reynolds numbers, physical observables (e.g., fluid flow velocity fields), and boundary conditions.

FLEX: A Backbone for Diffusion-Based Modeling of Spatio-temporal Physical Systems

TL;DR

FLEX tackles the challenge of modeling high-dimensional spatio-temporal physical systems with diffusion, introducing a residual-space, velocity-parametrized diffusion backbone that embeds a latent Transformer within a U-Net. It achieves hierarchical conditioning via a task-specific encoder, enabling both weak and strong conditioning to balance diversity and fidelity. Theoretical analysis indicates residual-space initialization reduces the variance of the optimal velocity field, improving stability, while experiments on 2048×2048 2D turbulence demonstrate state-of-the-art super-resolution and forecasting with calibrated uncertainty and zero-shot generalization to unseen observables and boundary conditions. Overall, FLEX provides a scalable, uncertainty-aware framework that integrates global context modeling with local detail for physics-guided generative modeling of complex flows.

Abstract

We introduce FLEX (FLow EXpert), a backbone architecture for generative modeling of spatio-temporal physical systems using diffusion models. FLEX operates in the residual space rather than on raw data, a modeling choice that we motivate theoretically, showing that it reduces the variance of the velocity field in the diffusion model, which helps stabilize training. FLEX integrates a latent Transformer into a U-Net with standard convolutional ResNet layers and incorporates a redesigned skip connection scheme. This hybrid design enables the model to capture both local spatial detail and long-range dependencies in latent space. To improve spatio-temporal conditioning, FLEX uses a task-specific encoder that processes auxiliary inputs such as coarse or past snapshots. Weak conditioning is applied to the shared encoder via skip connections to promote generalization, while strong conditioning is applied to the decoder through both skip and bottleneck features to ensure reconstruction fidelity. FLEX achieves accurate predictions for super-resolution and forecasting tasks using as few as two reverse diffusion steps. It also produces calibrated uncertainty estimates through sampling. Evaluations on high-resolution 2D turbulence data show that FLEX outperforms strong baselines and generalizes to out-of-distribution settings, including unseen Reynolds numbers, physical observables (e.g., fluid flow velocity fields), and boundary conditions.

Paper Structure

This paper contains 39 sections, 3 theorems, 30 equations, 13 figures, 5 tables, 2 algorithms.

Key Result

Proposition 1

Let $D_F(p \| q)$ denote the Fisher divergence between two probability density functions $p$ and $q$. We have, for $t \in [0,1]$,

Figures (13)

  • Figure 1: FLEX is a backbone for modeling spatio-temporal physical systems using diffusion models. It learns residual corrections conditioned on task-specific inputs (e.g., low-resolution or past states) and physical parameters. The architecture integrates a task-specific encoder, a common encoder, a Transformer operating in latent space, and a decoder within a U-Net-style framework. The task-specific encoder weakly conditions the common encoder via shallow skip connections, and strongly conditions the decoder through deep skip connections and an embedding of the full conditional input. Shown here: FLEX instantiated for super-resolution.
  • Figure 2: Illustration of the multi-task FLEX backbone instantiated for super-resolution and forecasting. The backbone includes two task-specific encoders and shares a common encoder, a latent Transformer, and a decoder. During inference, only the encoder corresponding to the target task is used.
  • Figure 3: Illustration of the score-based diffusion process. During training, residual samples are progressively corrupted by Gaussian noise via the forward process according to a noise schedule. During inference, samples are reconstructed by simulating the reverse-time SDE/ODE, which progressively denoises a noisy sample back toward the clean residual. In our framework, we use the velocity-based parameterization (Equation \ref{['eq_velocity']}) of the score (see also Equation \ref{['eq_v_intermsof_score']} in Appendix \ref{['app_theory_velocitymodel']}).
  • Figure 4: Example snapshot demonstrating FLEX’s performance on vorticity field super-resolution at $Re = 16{,}000$. (a) Low-resolution vorticity snapshot with patch boundaries. (b–c) Comparison between the ground truth (b) and FLEX’s super-resolved output (c). (d) Error map showing small boundary artifacts. (e) Spatial uncertainty map estimated from ensembles showing higher uncertainty near complex vortex interactions. (f) Vorticity spectrum comparing reconstruction to the ground truth.
  • Figure 5: (a) Pull distribution analysis shows that the model provides unbiased uncertainty estimates, while being slight overconfident. (b) We show that the overconfidence decreases as a function of ensemble size and number of diffusion steps, yet does not reach $\sigma=1$.
  • ...and 8 more figures

Theorems & Definitions (6)

  • Proposition 1
  • proof
  • Lemma 1
  • proof
  • Proposition 2
  • proof