Table of Contents
Fetching ...

Understanding, Accelerating, and Improving MeanFlow Training

Jin-Young Kim, Hyojun Go, Lea Bogensperger, Julius Erbach, Nikolai Kalischek, Federico Tombari, Konrad Schindler, Dominik Narnhofer

TL;DR

The paper investigates the training dynamics of MeanFlow, a diffusion/flow-based model that learns instantaneous velocity $v$ and average velocity $u$ jointly. It shows that forming $v$ early is crucial for successful $u$ learning, and that small temporal gaps $\Delta t$ in supervision aid $v$ formation while large gaps destabilize it; a task-affinity analysis further suggests starting with small-gap supervision to enable large-gap $u$ learning for one-step generation. The authors propose a simple, effective training scheme consisting of accelerating $v$-learning with timestep sampling and loss weighting, and progressively weighting $\mathcal{L}_u$ to shift focus from short- to long-interval averages, while integrating with MeanFlow's components. Empirically, the enhanced training yields faster convergence and stronger few-step generation, reducing the 1-NFE ImageNet 256×256 FID from $3.43$ to $2.87$ (MeanFlow-XL) and enabling the same performance with $2.5\times$ fewer epochs or a smaller backbone, demonstrating practical potential for real-time generation with few steps.

Abstract

MeanFlow promises high-quality generative modeling in few steps, by jointly learning instantaneous and average velocity fields. Yet, the underlying training dynamics remain unclear. We analyze the interaction between the two velocities and find: (i) well-established instantaneous velocity is a prerequisite for learning average velocity; (ii) learning of instantaneous velocity benefits from average velocity when the temporal gap is small, but degrades as the gap increases; and (iii) task-affinity analysis indicates that smooth learning of large-gap average velocities, essential for one-step generation, depends on the prior formation of accurate instantaneous and small-gap average velocities. Guided by these observations, we design an effective training scheme that accelerates the formation of instantaneous velocity, then shifts emphasis from short- to long-interval average velocity. Our enhanced MeanFlow training yields faster convergence and significantly better few-step generation: With the same DiT-XL backbone, our method reaches an impressive FID of 2.87 on 1-NFE ImageNet 256x256, compared to 3.43 for the conventional MeanFlow baseline. Alternatively, our method matches the performance of the MeanFlow baseline with 2.5x shorter training time, or with a smaller DiT-L backbone.

Understanding, Accelerating, and Improving MeanFlow Training

TL;DR

The paper investigates the training dynamics of MeanFlow, a diffusion/flow-based model that learns instantaneous velocity and average velocity jointly. It shows that forming early is crucial for successful learning, and that small temporal gaps in supervision aid formation while large gaps destabilize it; a task-affinity analysis further suggests starting with small-gap supervision to enable large-gap learning for one-step generation. The authors propose a simple, effective training scheme consisting of accelerating -learning with timestep sampling and loss weighting, and progressively weighting to shift focus from short- to long-interval averages, while integrating with MeanFlow's components. Empirically, the enhanced training yields faster convergence and stronger few-step generation, reducing the 1-NFE ImageNet 256×256 FID from to (MeanFlow-XL) and enabling the same performance with fewer epochs or a smaller backbone, demonstrating practical potential for real-time generation with few steps.

Abstract

MeanFlow promises high-quality generative modeling in few steps, by jointly learning instantaneous and average velocity fields. Yet, the underlying training dynamics remain unclear. We analyze the interaction between the two velocities and find: (i) well-established instantaneous velocity is a prerequisite for learning average velocity; (ii) learning of instantaneous velocity benefits from average velocity when the temporal gap is small, but degrades as the gap increases; and (iii) task-affinity analysis indicates that smooth learning of large-gap average velocities, essential for one-step generation, depends on the prior formation of accurate instantaneous and small-gap average velocities. Guided by these observations, we design an effective training scheme that accelerates the formation of instantaneous velocity, then shifts emphasis from short- to long-interval average velocity. Our enhanced MeanFlow training yields faster convergence and significantly better few-step generation: With the same DiT-XL backbone, our method reaches an impressive FID of 2.87 on 1-NFE ImageNet 256x256, compared to 3.43 for the conventional MeanFlow baseline. Alternatively, our method matches the performance of the MeanFlow baseline with 2.5x shorter training time, or with a smaller DiT-L backbone.

Paper Structure

This paper contains 48 sections, 10 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Our novel, enhanced training strategy reaches the performance of MeanFlow-XL in $\approx$2.5$\times$ fewer training epochs, and converges to a final model with superior performance ($\approx$16% lower FID).
  • Figure 2: $v$-learning facilitates $u$-learning.(Top) 1-NFE FID during $u$-finetuning according to $v$-pretraining epochs. (Bottom) 1-NFE FID under a fixed 80-epoch budget with varying allocation between $v$-pretraining and $u$-finetuning. Both settings show that investing in $v$-learning improves $u$-learning quality.
  • Figure 3: Corruption in $v$-learning disrupts $u$-learning. 1-NFE FID when training with $\mathcal{L}_{\mathrm{MF}}$ while injecting Gaussian noise scaled by $k\!\cdot\!\|v_t(z_t|\epsilon)\|$ into the target velocity of $\mathcal{L}_v$. Even small noise ($k = 0.03$) disrupts $v$-learning and severely degrades $u$-learning performance compared to clean training ($k=0$).
  • Figure 4: Impact of $\Delta t$ of $u$-learning on $v$-learning. 32-NFE FID after 40 epochs of $u$ finetuning across different $\Delta t$ ranges, starting from either random initialization (blue) or $v$-pretrained model (orange, 40 epochs). Small $\Delta t$ enables constructing and improving $v$, while large $\Delta t$ degrades pretrained $v$. The green line denotes the performance of the $v$-pretrained model.
  • Figure 5: Task affinity between $v$- and $u$-learning across $\Delta t$ ranges. Small-$\Delta t$$u$-pretraining achieves higher affinity for large $\Delta t$ compared to $v$-pretraining, providing a better regime for learning large-gap average velocity with instantaneous velocity.
  • ...and 5 more figures