Table of Contents
Fetching ...

FlowDA: Accurate, Low-Latency Weather Data Assimilation via Flow Matching

Ran Cheng, Lailai Zhu

TL;DR

FlowDA addresses the computational bottleneck of traditional data assimilation by using flow matching to perform fast, data-driven analyses conditioned on sparse observations. It combines a SetConv-based observation embedding with a fine-tuned Aurora foundation model to learn a velocity field that morphs a background state into an analysis, enabling low-latency, robust assimilation. The method outperforms baselines on single-step and long-horizon cycling tasks across varying observation densities and noise levels, with strong robustness and substantial speed advantages. This approach demonstrates a scalable, data-driven direction for weather-scale data assimilation with practical implications for ML-based forecasting pipelines.

Abstract

Data assimilation (DA) is a fundamental component of modern weather prediction, yet it remains a major computational bottleneck in machine learning (ML)-based forecasting pipelines due to reliance on traditional variational methods. Recent generative ML-based DA methods offer a promising alternative but typically require many sampling steps and suffer from error accumulation under long-horizon auto-regressive rollouts with cycling assimilation. We propose FlowDA, a low-latency weather-scale generative DA framework based on flow matching. FlowDA conditions on observations through a SetConv-based embedding and fine-tunes the Aurora foundation model to deliver accurate, efficient, and robust analyses. Experiments across observation rates decreasing from $3.9\%$ to $0.1\%$ demonstrate superior performance of FlowDA over strong baselines with similar tunable-parameter size. FlowDA further shows robustness to observational noise and stable performance in long-horizon auto-regressive cycling DA. Overall, FlowDA points to an efficient and scalable direction for data-driven DA.

FlowDA: Accurate, Low-Latency Weather Data Assimilation via Flow Matching

TL;DR

FlowDA addresses the computational bottleneck of traditional data assimilation by using flow matching to perform fast, data-driven analyses conditioned on sparse observations. It combines a SetConv-based observation embedding with a fine-tuned Aurora foundation model to learn a velocity field that morphs a background state into an analysis, enabling low-latency, robust assimilation. The method outperforms baselines on single-step and long-horizon cycling tasks across varying observation densities and noise levels, with strong robustness and substantial speed advantages. This approach demonstrates a scalable, data-driven direction for weather-scale data assimilation with practical implications for ML-based forecasting pipelines.

Abstract

Data assimilation (DA) is a fundamental component of modern weather prediction, yet it remains a major computational bottleneck in machine learning (ML)-based forecasting pipelines due to reliance on traditional variational methods. Recent generative ML-based DA methods offer a promising alternative but typically require many sampling steps and suffer from error accumulation under long-horizon auto-regressive rollouts with cycling assimilation. We propose FlowDA, a low-latency weather-scale generative DA framework based on flow matching. FlowDA conditions on observations through a SetConv-based embedding and fine-tunes the Aurora foundation model to deliver accurate, efficient, and robust analyses. Experiments across observation rates decreasing from to demonstrate superior performance of FlowDA over strong baselines with similar tunable-parameter size. FlowDA further shows robustness to observational noise and stable performance in long-horizon auto-regressive cycling DA. Overall, FlowDA points to an efficient and scalable direction for data-driven DA.
Paper Structure (23 sections, 9 equations, 6 figures, 3 tables, 1 algorithm)

This paper contains 23 sections, 9 equations, 6 figures, 3 tables, 1 algorithm.

Figures (6)

  • Figure 1: Workflow of FlowDA inference for single-step DA initialized from a 48-hour forecast. FlowDA first applies a SetConv layer whose weights depend on the local observation density $\alpha_m$ and relative distance, serving as an analogue of inverse observation operator $\mathcal{H}^{-1}$. A fine-tuned Aurora model then takes the flow state ${\mathbf{X}}_{t,\tau}$ conditioned on ${\mathbf{x}}^{\mathrm{o}}_t$ and ${\bm{\rho}}^{\mathrm{o}}t$, and estimates a velocity field ${\mathbf{u}}^{\theta}{\tau}$ that corrects the background ${\mathbf{x}}^{\mathrm{b}}_t$ into the analysis ${\mathbf{x}}^{\mathrm{a}}_t$. The analysis is produced through iterative forward Euler integration.
  • Figure 2: Workflow of the SetConv operator, which converts sparse observations $\mathbf{y}_t$ into a continuous model-space field $\mathbf{x}^{\mathrm{o}}_t$ and the corresponding observation density distribution ${\bm{\rho}}^{\mathrm{o}}_t$ via an MLP-based kernel.
  • Figure 3: Stage I and II fine-tuning protocols.
  • Figure 4: Single-step DA with ERA5-based observations perturbed by additive Gaussian noise. Row 1: background field (left) from a 48-hour free-running forecast and the corresponding background error (right). Row 2: FlowDA analysis (left) and increment (right) for $\alpha \approx 0.1\%$ and $\tilde{\sigma}_{\text{noise}}=0.0$. Row 3: FlowDA analysis (left) and increment (right) for $\alpha \approx 3.9\%$ and $\tilde{\sigma}_{\text{noise}}=0.2$.
  • Figure 5: Benchmark comparison of FlowDA against baselines in a 15-day cycling DA experiment (6-hour cycle) with four observation rates $\alpha$. For each rate, the observation locations are held fixed over the full cycle. Shown are the lead-time evolutions of RMSE for z500, t850, and T2M relative to the ERA5 ground truth.
  • ...and 1 more figures