Table of Contents
Fetching ...

Terminal Velocity Matching

Linqi Zhou, Mathias Parger, Ayaan Haque, Jiaming Song

TL;DR

TVM reframes diffusion-based generation by enforcing terminal-velocity matching rather than initial-velocity flow, enabling high-quality one-/few-step sampling in a single training stage.It provides a theoretical upper bound on the 2-Wasserstein distance under Lipschitz conditions and develops practical architectural and algorithmic fixes to overcome non-Lipschitz behavior in Diffusion Transformers.Key innovations include a two-time conditioned neural network for joint displacement and velocity learning, a terminal velocity-based loss with proxy velocities, a Flash Attention JVP kernel for efficiency, and a scaled CFG-aware training strategy that stabilizes learning across guidance strengths.Empirically, TVM achieves state-of-the-art 1-NFE FID on ImageNet-256×256 and strong 4-NFE performance on both 256×256 and 512×512, with robust training under randomly sampled CFG and no curriculum required.Overall, TVM provides a principled, scalable pathway to high-quality one-/few-step generative modeling with distributional guarantees and practical efficiency.

Abstract

We propose Terminal Velocity Matching (TVM), a generalization of flow matching that enables high-fidelity one- and few-step generative modeling. TVM models the transition between any two diffusion timesteps and regularizes its behavior at its terminal time rather than at the initial time. We prove that TVM provides an upper bound on the $2$-Wasserstein distance between data and model distributions when the model is Lipschitz continuous. However, since Diffusion Transformers lack this property, we introduce minimal architectural changes that achieve stable, single-stage training. To make TVM efficient in practice, we develop a fused attention kernel that supports backward passes on Jacobian-Vector Products, which scale well with transformer architectures. On ImageNet-256x256, TVM achieves 3.29 FID with a single function evaluation (NFE) and 1.99 FID with 4 NFEs. It similarly achieves 4.32 1-NFE FID and 2.94 4-NFE FID on ImageNet-512x512, representing state-of-the-art performance for one/few-step models from scratch.

Terminal Velocity Matching

TL;DR

TVM reframes diffusion-based generation by enforcing terminal-velocity matching rather than initial-velocity flow, enabling high-quality one-/few-step sampling in a single training stage.It provides a theoretical upper bound on the 2-Wasserstein distance under Lipschitz conditions and develops practical architectural and algorithmic fixes to overcome non-Lipschitz behavior in Diffusion Transformers.Key innovations include a two-time conditioned neural network for joint displacement and velocity learning, a terminal velocity-based loss with proxy velocities, a Flash Attention JVP kernel for efficiency, and a scaled CFG-aware training strategy that stabilizes learning across guidance strengths.Empirically, TVM achieves state-of-the-art 1-NFE FID on ImageNet-256×256 and strong 4-NFE performance on both 256×256 and 512×512, with robust training under randomly sampled CFG and no curriculum required.Overall, TVM provides a principled, scalable pathway to high-quality one-/few-step generative modeling with distributional guarantees and practical efficiency.

Abstract

We propose Terminal Velocity Matching (TVM), a generalization of flow matching that enables high-fidelity one- and few-step generative modeling. TVM models the transition between any two diffusion timesteps and regularizes its behavior at its terminal time rather than at the initial time. We prove that TVM provides an upper bound on the -Wasserstein distance between data and model distributions when the model is Lipschitz continuous. However, since Diffusion Transformers lack this property, we introduce minimal architectural changes that achieve stable, single-stage training. To make TVM efficient in practice, we develop a fused attention kernel that supports backward passes on Jacobian-Vector Products, which scale well with transformer architectures. On ImageNet-256x256, TVM achieves 3.29 FID with a single function evaluation (NFE) and 1.99 FID with 4 NFEs. It similarly achieves 4.32 1-NFE FID and 2.94 4-NFE FID on ImageNet-512x512, representing state-of-the-art performance for one/few-step models from scratch.

Paper Structure

This paper contains 33 sections, 3 theorems, 66 equations, 12 figures, 7 tables, 1 algorithm.

Key Result

Theorem 1

Given $t\in [0,1]$, let ${\mathbf{f}}_{t\rightarrow 0}^\theta \# p_t({\mathbf{x}}_t)$ be the distribution pushforward from $p_t({\mathbf{x}}_t)$ via ${\mathbf{f}}_\theta({\mathbf{x}}_t, t, 0)$, and assume ${\mathbf{u}}_\theta(\cdot, s)$ is Lipschitz-continuous for all $s\in [0,t]$ with Lipschitz con where $W_2(\cdot,\cdot)$ is $2$-Wasserstein distance, $\lambda[\cdot]$ is a functional of $L(\cdot)

Figures (12)

  • Figure 1: FID results on ImageNet-$256{\mkern-1mu\times\mkern-1mu} 256$.
  • Figure 2: An illustration of Terminal Velocity Matching. Left shows the ground-truth displacement map by integrating the true velocity. Right shows our model path directly jumping between points on the ground-truth path in one step. In our method, the one-step generation ${\mathbf{x}}_0$ from ${\mathbf{x}}_t$ coincides with ground-truth ${\mathbf{x}}_0$ if the terminal velocity of model $\odv{}{s}{\mathbf{f}}({\mathbf{x}}_t,t,s)$ coincides with ground-truth velocity ${\mathbf{u}}({\mathbf{x}}_s,s)$ for all $s\in[0,t]$ along the true flow path (see Eq. \ref{['eq:tve']}). The terminal velocity condition is jointly satisfied with the boundary case when model displacement is $0$, where matching $\odv{}{s}{\mathbf{f}}({\mathbf{x}}_t,t,s)\vert_{s=t}$ with ${\mathbf{u}}({\mathbf{x}}_t,t)$ reduces to Flow Matching.
  • Figure 3: PyTorch-style sampling code.
  • Figure 4: Activation norm of last time embedding layer. Same trends follow for all other layers.
  • Figure 5: Smoother terminal velocity error with $\beta_2=0.95$.
  • ...and 7 more figures

Theorems & Definitions (6)

  • Theorem 1: Connection to the $2$-Wasserstein distance
  • Lemma 1
  • proof
  • Theorem 1: Connection to the $2$-Wasserstein distance
  • proof
  • proof