Table of Contents
Fetching ...

Stable Velocity: A Variance Perspective on Flow Matching

Donglin Yang, Yongxing Zhang, Xin Yu, Liang Hou, Xin Tao, Pengfei Wan, Xiaojuan Qi, Renjie Liao

TL;DR

This work identifies a two-regime variance structure in flow matching targets: high variance near the prior and low variance near the data. It introduces Stable Velocity, a unified framework with StableVM (unbiased variance-reduced training), VA-REPA (adaptive, variance-aware supervision), and StableVS (finetuning-free sampling acceleration in the low-variance regime). The methods achieve consistent training-speedups and over 2× sampling acceleration on large pretrained models without compromising sample quality, validated on ImageNet and multiple T2I/T2V systems. The key contribution is a principled, variance-centric design that improves both learning stability and inference efficiency in diffusion/flow-based generative models.

Abstract

While flow matching is elegant, its reliance on single-sample conditional velocities leads to high-variance training targets that destabilize optimization and slow convergence. By explicitly characterizing this variance, we identify 1) a high-variance regime near the prior, where optimization is challenging, and 2) a low-variance regime near the data distribution, where conditional and marginal velocities nearly coincide. Leveraging this insight, we propose Stable Velocity, a unified framework that improves both training and sampling. For training, we introduce Stable Velocity Matching (StableVM), an unbiased variance-reduction objective, along with Variance-Aware Representation Alignment (VA-REPA), which adaptively strengthen auxiliary supervision in the low-variance regime. For inference, we show that dynamics in the low-variance regime admit closed-form simplifications, enabling Stable Velocity Sampling (StableVS), a finetuning-free acceleration. Extensive experiments on ImageNet $256\times256$ and large pretrained text-to-image and text-to-video models, including SD3.5, Flux, Qwen-Image, and Wan2.2, demonstrate consistent improvements in training efficiency and more than $2\times$ faster sampling within the low-variance regime without degrading sample quality. Our code is available at https://github.com/linYDTHU/StableVelocity.

Stable Velocity: A Variance Perspective on Flow Matching

TL;DR

This work identifies a two-regime variance structure in flow matching targets: high variance near the prior and low variance near the data. It introduces Stable Velocity, a unified framework with StableVM (unbiased variance-reduced training), VA-REPA (adaptive, variance-aware supervision), and StableVS (finetuning-free sampling acceleration in the low-variance regime). The methods achieve consistent training-speedups and over 2× sampling acceleration on large pretrained models without compromising sample quality, validated on ImageNet and multiple T2I/T2V systems. The key contribution is a principled, variance-centric design that improves both learning stability and inference efficiency in diffusion/flow-based generative models.

Abstract

While flow matching is elegant, its reliance on single-sample conditional velocities leads to high-variance training targets that destabilize optimization and slow convergence. By explicitly characterizing this variance, we identify 1) a high-variance regime near the prior, where optimization is challenging, and 2) a low-variance regime near the data distribution, where conditional and marginal velocities nearly coincide. Leveraging this insight, we propose Stable Velocity, a unified framework that improves both training and sampling. For training, we introduce Stable Velocity Matching (StableVM), an unbiased variance-reduction objective, along with Variance-Aware Representation Alignment (VA-REPA), which adaptively strengthen auxiliary supervision in the low-variance regime. For inference, we show that dynamics in the low-variance regime admit closed-form simplifications, enabling Stable Velocity Sampling (StableVS), a finetuning-free acceleration. Extensive experiments on ImageNet and large pretrained text-to-image and text-to-video models, including SD3.5, Flux, Qwen-Image, and Wan2.2, demonstrate consistent improvements in training efficiency and more than faster sampling within the low-variance regime without degrading sample quality. Our code is available at https://github.com/linYDTHU/StableVelocity.
Paper Structure (36 sections, 12 theorems, 97 equations, 12 figures, 10 tables, 2 algorithms)

This paper contains 36 sections, 12 theorems, 97 equations, 12 figures, 10 tables, 2 algorithms.

Key Result

Theorem 3.1

Figures (12)

  • Figure 1: Variance curves of ${\mathcal{V}}_{\text{CFM}}(t)$ with 15%–85% quantile bands. Evaluated on GMMs of varying dimensionality, CIFAR-10 images, and $256\times256$ ImageNet latents obtained by the Stable Diffusion VAE. The $y$-axis reports ${\mathcal{V}}_{\text{CFM}}(t)$ normalized by the square root of the data dimension. See Appendix \ref{['appendix:unconditional_generation']} for details.
  • Figure 2: Illustration of CFM variance ${\mathcal{V}}_{\text{CFM}}(t)$. (a) The low-variance regime ($t\le \xi$), where the posterior $p_t({\bm{x}}_0\mid {\bm{x}}_t)$ is sharply concentrated and the conditional velocity ${\bm{v}}_t({\bm{x}}_t\mid {\bm{x}}_0)$ nearly coincides with the true velocity ${\bm{v}}_t({\bm{x}}_t)$, yielding ${\mathcal{V}}_{\text{CFM}}(t)\approx 0$. (b) The high-variance regime ($t>\xi$), the posterior spreads over multiple reference samples, causing the conditional velocity to fluctuate and resulting in a large ${\mathcal{V}}_{\text{CFM}}(t)$.
  • Figure 3: Motivation for variance-aware representation alignment.(a) In the low-variance regime, the alignment loss remains consistently low on a pretrained model from REPA yu2024representation, indicating a learnable and informative supervision signal. In contrast, in the high-variance regime, the loss stays high, reflecting the ill-posed nature of semantic recovery from noise. (b) Restricting representation alignment to the low-variance regime yields the best FID, while applying it only in the high-variance regime provides minimal meaningful improvement over the baseline. These results indicate that representation alignment should be activated adaptively rather than uniformly along the diffusion trajectory.
  • Figure 4: Ablation on VA-REPA weighting and StableVM bank capacity. Left: effect of different weighting schemes $w(t)$, showing that soft weightings outperform hard thresholding. Right: effect of memory bank capacity $K$, where $K=256$ already achieves near-optimal performance. All results are evaluated at 100k iterations. REPA baseline is shown as a dashed line.
  • Figure 5: Visual comparison across prompts on SD3.5 esser2024scaling. Results are generated using the Euler solver with 30 and 20 steps, and with StableVS replacing Euler in the low-variance regime, all under the same random seeds. Compared to the standard 20-step solver, StableVS yields outputs that more closely resemble the 30-step results. Zoom in for details. Additional qualitative comparisons are provided in Appendix \ref{['appen:qualitative results']}.
  • ...and 7 more figures

Theorems & Definitions (21)

  • Theorem 3.1
  • Theorem 3.2
  • Theorem 3.3
  • Proposition 5.0
  • proof
  • Theorem 5.1
  • proof
  • Theorem 5.1
  • proof
  • Lemma 5.1
  • ...and 11 more