Table of Contents
Fetching ...

Rethinking Forward Processes for Score-Based Data Assimilation in High Dimensions

Eunbi Yoon, Donghan Kim, Dae Wook Kim

Abstract

Data assimilation is the process of estimating the time-evolving state of a dynamical system by integrating model predictions and noisy observations. It is commonly formulated as Bayesian filtering, but classical filters often struggle with accuracy or computational feasibility in high dimensions. Recently, score-based generative models have emerged as a scalable approach for high-dimensional data assimilation, enabling accurate modeling and sampling of complex distributions. However, existing score-based filters often specify the forward process independently of the data assimilation. As a result, the measurement-update step depends on heuristic approximations of the likelihood score, which can accumulate errors and degrade performance over time. Here, we propose a measurement-aware score-based filter (MASF) that defines a measurement-aware forward process directly from the measurement equation. This construction makes the likelihood score analytically tractable: for linear measurements, we derive the exact likelihood score and combine it with a learned prior score to obtain the posterior score. Numerical experiments covering a range of settings, including high-dimensional datasets, demonstrate improved accuracy and stability over existing score-based filters.

Rethinking Forward Processes for Score-Based Data Assimilation in High Dimensions

Abstract

Data assimilation is the process of estimating the time-evolving state of a dynamical system by integrating model predictions and noisy observations. It is commonly formulated as Bayesian filtering, but classical filters often struggle with accuracy or computational feasibility in high dimensions. Recently, score-based generative models have emerged as a scalable approach for high-dimensional data assimilation, enabling accurate modeling and sampling of complex distributions. However, existing score-based filters often specify the forward process independently of the data assimilation. As a result, the measurement-update step depends on heuristic approximations of the likelihood score, which can accumulate errors and degrade performance over time. Here, we propose a measurement-aware score-based filter (MASF) that defines a measurement-aware forward process directly from the measurement equation. This construction makes the likelihood score analytically tractable: for linear measurements, we derive the exact likelihood score and combine it with a learned prior score to obtain the posterior score. Numerical experiments covering a range of settings, including high-dimensional datasets, demonstrate improved accuracy and stability over existing score-based filters.

Paper Structure

This paper contains 49 sections, 13 theorems, 98 equations, 9 figures, 1 algorithm.

Key Result

Theorem A.1

Let $A(\cdot)$ and $\Sigma(\cdot)$ be as in eq:prescribed_moments, with $A(t)$ invertible for all $t\in[0,1)$. Consider eq:linearSDE_app with $F:[0,1)\to\mathbb{R}^{d\times d}$ and $G:[0,1)\to\mathbb{R}^{d\times d}$ satisfying, for all $t\in[0,1)$, Assume additionally that $F$ and $G$ are locally bounded on $[0,1)$ (e.g., continuous on $[0,T]$ for every $T<1$). Then for every $T<1$, the SDE eq:li

Figures (9)

  • Figure 1: Schematic comparison of likelihood score. (a) Existing approaches specify the forward process independently of the measurement equation, which makes the likelihood intractable. (b) Our approach aligns the forward process with the measurement equation, so the likelihood score becomes tractable.
  • Figure 2: Pipeline of the proposed method, MASF. The forward process is constructed by interpolating between the identity and the measurement operator, so that the state is progressively degraded toward the measurement. The reverse-time process samples state trajectories from the posterior.
  • Figure 3: State trajectories for the Lorenz--63 system with measurement gap $\mathbf{100}$. Each panel shows the reference trajectory and the assimilated trajectory produced by one of the considered methods: (a) EnKF, (b) SF, (c) SSLS, and (d) MASF. The title of each subplot reports the trajectory RMSE for a representative run (seed 1), followed by the mean $\pm$ standard deviation of RMSE computed over five random seeds. Overall, MASF achieves consistently lower RMSE compared to the baselines.
  • Figure 4: Performance on the Lorenz--96 system across state dimension, chaoticity, and measurement sparsity. Panels (a)--(b) vary the state dimension, (c)--(d) vary the forcing parameter, and (e)--(f) vary the measurement gap, with the remaining parameters fixed as indicated in each panel title. Across all three sweeps, MASF achieves consistently lower RMSE and shows robust performance under variations in dimension, forcing, and measurement gap. The mean $\pm$ standard deviation of RMSE computed over five random seeds.
  • Figure 5: Performance on the Kolmogorov flow. (a) RMSE as a function of the measurement gap. Points show the mean over 5 random seeds and error bars indicate $\pm$ standard deviation across seeds. (b,c) RMSE over time for representative runs at gap$=5$ (b) and gap$=25$ (c) with seed 0. Open circles denote measurement-update steps; numbers in parentheses report the time-averaged RMSE for each method on the shown trajectory. Across gaps, MASF achieves the lowest mean RMSE compared to the baselines.
  • ...and 4 more figures

Theorems & Definitions (24)

  • Theorem A.1: Moment-matching SDE
  • Lemma A.2: Variation-of-constants formula
  • proof
  • Lemma A.3: Matching the conditional mean
  • proof
  • Proposition A.4: Lyapunov equation for the conditional covariance
  • proof
  • Lemma A.5: Matching the covariance
  • proof
  • Corollary A.6: Moment-matching with linear interpolation
  • ...and 14 more