Understanding, Accelerating, and Improving MeanFlow Training
Jin-Young Kim, Hyojun Go, Lea Bogensperger, Julius Erbach, Nikolai Kalischek, Federico Tombari, Konrad Schindler, Dominik Narnhofer
TL;DR
The paper investigates the training dynamics of MeanFlow, a diffusion/flow-based model that learns instantaneous velocity $v$ and average velocity $u$ jointly. It shows that forming $v$ early is crucial for successful $u$ learning, and that small temporal gaps $\Delta t$ in supervision aid $v$ formation while large gaps destabilize it; a task-affinity analysis further suggests starting with small-gap supervision to enable large-gap $u$ learning for one-step generation. The authors propose a simple, effective training scheme consisting of accelerating $v$-learning with timestep sampling and loss weighting, and progressively weighting $\mathcal{L}_u$ to shift focus from short- to long-interval averages, while integrating with MeanFlow's components. Empirically, the enhanced training yields faster convergence and stronger few-step generation, reducing the 1-NFE ImageNet 256×256 FID from $3.43$ to $2.87$ (MeanFlow-XL) and enabling the same performance with $2.5\times$ fewer epochs or a smaller backbone, demonstrating practical potential for real-time generation with few steps.
Abstract
MeanFlow promises high-quality generative modeling in few steps, by jointly learning instantaneous and average velocity fields. Yet, the underlying training dynamics remain unclear. We analyze the interaction between the two velocities and find: (i) well-established instantaneous velocity is a prerequisite for learning average velocity; (ii) learning of instantaneous velocity benefits from average velocity when the temporal gap is small, but degrades as the gap increases; and (iii) task-affinity analysis indicates that smooth learning of large-gap average velocities, essential for one-step generation, depends on the prior formation of accurate instantaneous and small-gap average velocities. Guided by these observations, we design an effective training scheme that accelerates the formation of instantaneous velocity, then shifts emphasis from short- to long-interval average velocity. Our enhanced MeanFlow training yields faster convergence and significantly better few-step generation: With the same DiT-XL backbone, our method reaches an impressive FID of 2.87 on 1-NFE ImageNet 256x256, compared to 3.43 for the conventional MeanFlow baseline. Alternatively, our method matches the performance of the MeanFlow baseline with 2.5x shorter training time, or with a smaller DiT-L backbone.
