Table of Contents
Fetching ...

Momentum Guidance: Plug-and-Play Guidance for Flow Models

Runlong Liao, Jian Yu, Baiyu Su, Chi Zhang, Lizhang Chen, Qiang Liu

TL;DR

Momentum Guidance is introduced, a new dimension of guidance that leverages the ODE trajectory itself and matches the effect of standard guidance without extra computation and can further improve quality when combined with CFG.

Abstract

Flow-based generative models have become a strong framework for high-quality generative modeling, yet pretrained models are rarely used in their vanilla conditional form: conditional samples without guidance often appear diffuse and lack fine-grained detail due to the smoothing effects of neural networks. Existing guidance techniques such as classifier-free guidance (CFG) improve fidelity but double the inference cost and typically reduce sample diversity. We introduce Momentum Guidance (MG), a new dimension of guidance that leverages the ODE trajectory itself. MG extrapolates the current velocity using an exponential moving average of past velocities and preserves the standard one-evaluation-per-step cost. It matches the effect of standard guidance without extra computation and can further improve quality when combined with CFG. Experiments demonstrate MG's effectiveness across benchmarks. Specifically, on ImageNet-256, MG achieves average improvements in FID of 36.68% without CFG and 25.52% with CFG across various sampling settings, attaining an FID of 1.597 at 64 sampling steps. Evaluations on large flow-based models like Stable Diffusion 3 and FLUX.1-dev further confirm consistent quality enhancements across standard metrics.

Momentum Guidance: Plug-and-Play Guidance for Flow Models

TL;DR

Momentum Guidance is introduced, a new dimension of guidance that leverages the ODE trajectory itself and matches the effect of standard guidance without extra computation and can further improve quality when combined with CFG.

Abstract

Flow-based generative models have become a strong framework for high-quality generative modeling, yet pretrained models are rarely used in their vanilla conditional form: conditional samples without guidance often appear diffuse and lack fine-grained detail due to the smoothing effects of neural networks. Existing guidance techniques such as classifier-free guidance (CFG) improve fidelity but double the inference cost and typically reduce sample diversity. We introduce Momentum Guidance (MG), a new dimension of guidance that leverages the ODE trajectory itself. MG extrapolates the current velocity using an exponential moving average of past velocities and preserves the standard one-evaluation-per-step cost. It matches the effect of standard guidance without extra computation and can further improve quality when combined with CFG. Experiments demonstrate MG's effectiveness across benchmarks. Specifically, on ImageNet-256, MG achieves average improvements in FID of 36.68% without CFG and 25.52% with CFG across various sampling settings, attaining an FID of 1.597 at 64 sampling steps. Evaluations on large flow-based models like Stable Diffusion 3 and FLUX.1-dev further confirm consistent quality enhancements across standard metrics.
Paper Structure (32 sections, 18 equations, 17 figures, 5 tables, 2 algorithms)

This paper contains 32 sections, 18 equations, 17 figures, 5 tables, 2 algorithms.

Figures (17)

  • Figure 1: Comparison of Momentum Guidance (MG) with baseline class-conditioned sampling without CFG on SD3 esser2024scaling. Unlike CFG, which requires an additional forward pass through an unconditional branch at every sampling step, MG introduces no extra model evaluations. The generated images show consistently improved quality and coherence, with finer local details (e.g., angel’s wings, intricate coral structures), fewer artifacts (e.g., reduced blur in motorcycle reflections), richer visual textures and color variation (e.g., waterfall and volcanic scenes), and more stable object geometry (e.g., clearer facial contours and cleaner edges). Overall, MG yields sharper, cleaner, and more visually consistent results.
  • Figure 2: Visualization of momentum guidance along the sampling trajectory. From left to right, flow time increases and the data estimates transition from blurriness to a clean image. The first row shows the baseline estimates $\hat{\boldsymbol X}_{1\mid t}^{\text{Base}}$, while the second row displays the momentum-guided estimates $\hat{\boldsymbol X}_{1\mid t}^{\text{MG}}$, which exhibit sharper structure, richer color contrast, and more coherent fine-grained details throughout the flow process. The third row visualizes the extrapolation term $(\boldsymbol v_t - \boldsymbol m_t)$, revealing how momentum introduces a corrective direction that emphasizes coarse contours at early times and amplifies high-frequency details, such as petal edges and dew droplets, near the end of the trajectory. Overall, momentum guidance produces a clearer evolution toward the final image.
  • Figure 3: Ablation over CFG scale and NFE for ImageNet-256. Top row: FID as a function of classifier-free guidance (CFG) scale under three sampling budgets $( \textit{NFE}\!=\!16,32,64 )$. Solid curves denote the best Momentum Guidance configuration for each $(\text{CFG},\text{NFE})$ pair, while the shaded bands show the performance of other MG hyperparameter settings $(\alpha,\beta)$. Across all combinations, MG consistently lowers FID compared with vanilla CFG, with especially large improvements at low NFE (e.g., NFE = 16), where both the best curves and the shaded variants exhibit sizable reductions. Bottom row: Precision–Recall (PR) trade-off curves plotted as Pareto fronts parameterized by recall (RC). Although increasing CFG generally reduces recall for the baseline, MG shifts the curve upward and to the right: at low CFG, MG improves precision while matching or even increasing recall, and at higher CFG, MG mitigates the collapse in recall that typically accompanies aggressive guidance. Overall, MG delivers a better PR–RC Pareto front across all NFE settings.
  • Figure 4: Qualitative comparison across varying CFG scales on SD3. Momentum Guidance consistently improves the generated images across a wide range of CFG scales. While the baseline exhibits fluctuations in sharpness, texture quality, and structural stability as CFG increases, our method maintains crisp details, balanced contrast, and robust scene fidelity. This demonstrates that our approach delivers reliable, high-quality outputs under both strong or weak guidance settings.
  • Figure 5: FID-10K over Momentum Guidance hyperparameters $(\alpha,\beta)$ at $\text{CFG}=1.2$. Across nearly all settings of $\alpha$ and $\beta$, Momentum Guidance improves FID relative to the $\alpha=0$ baseline, demonstrating the robustness of our method.
  • ...and 12 more figures