Terminal Velocity Matching
Linqi Zhou, Mathias Parger, Ayaan Haque, Jiaming Song
TL;DR
TVM reframes diffusion-based generation by enforcing terminal-velocity matching rather than initial-velocity flow, enabling high-quality one-/few-step sampling in a single training stage.It provides a theoretical upper bound on the 2-Wasserstein distance under Lipschitz conditions and develops practical architectural and algorithmic fixes to overcome non-Lipschitz behavior in Diffusion Transformers.Key innovations include a two-time conditioned neural network for joint displacement and velocity learning, a terminal velocity-based loss with proxy velocities, a Flash Attention JVP kernel for efficiency, and a scaled CFG-aware training strategy that stabilizes learning across guidance strengths.Empirically, TVM achieves state-of-the-art 1-NFE FID on ImageNet-256×256 and strong 4-NFE performance on both 256×256 and 512×512, with robust training under randomly sampled CFG and no curriculum required.Overall, TVM provides a principled, scalable pathway to high-quality one-/few-step generative modeling with distributional guarantees and practical efficiency.
Abstract
We propose Terminal Velocity Matching (TVM), a generalization of flow matching that enables high-fidelity one- and few-step generative modeling. TVM models the transition between any two diffusion timesteps and regularizes its behavior at its terminal time rather than at the initial time. We prove that TVM provides an upper bound on the $2$-Wasserstein distance between data and model distributions when the model is Lipschitz continuous. However, since Diffusion Transformers lack this property, we introduce minimal architectural changes that achieve stable, single-stage training. To make TVM efficient in practice, we develop a fused attention kernel that supports backward passes on Jacobian-Vector Products, which scale well with transformer architectures. On ImageNet-256x256, TVM achieves 3.29 FID with a single function evaluation (NFE) and 1.99 FID with 4 NFEs. It similarly achieves 4.32 1-NFE FID and 2.94 4-NFE FID on ImageNet-512x512, representing state-of-the-art performance for one/few-step models from scratch.
