Table of Contents
Fetching ...

MAMBO-G: Magnitude-Aware Mitigation for Boosted Guidance

Shangwen Zhu, Qianyu Peng, Zhilei Shu, Yuting Hu, Zhantao Yang, Han Zhang, Zhao Pu, Andy Zheng, Xinyu Cui, Jian Zhao, Ruili Feng, Fan Cheng

TL;DR

This work tackles instability and high compute costs in classifier-free guidance (CFG) for large-scale diffusion, particularly under zero-SNR conditions and high latent dimensionality. It proposes MAMBO-G, a training-free, magnitude-aware adaptive damping strategy that uses the ratio $r_t$ between conditional and unconditional velocity updates to modulate guidance strength via $w(r_t)=1+(w_{\max}-1)\exp(-\alpha r_t)$, stabilizing early steps and enabling faster convergence. The approach is validated across text-to-image and text-to-video tasks (SD3.5, Lumina, Wan2.1), achieving up to 3x–4x speedups while preserving, and often improving, sample fidelity as measured by ImageReward, CLIPScore, and vBench; it also proves compatible and complementary with other guidance strategies. The findings highlight magnitude-aware control as a robust, efficient component for scaling diffusion-based synthesis, with practical impact on high-resolution image and video generation in production-like pipelines.

Abstract

High-fidelity text-to-image and text-to-video generation typically relies on Classifier-Free Guidance (CFG), but achieving optimal results often demands computationally expensive sampling schedules. In this work, we propose MAMBO-G, a training-free acceleration framework that significantly reduces computational cost by dynamically optimizing guidance magnitudes. We observe that standard CFG schedules are inefficient, applying disproportionately large updates in early steps that hinder convergence speed. MAMBO-G mitigates this by modulating the guidance scale based on the update-to-prediction magnitude ratio, effectively stabilizing the trajectory and enabling rapid convergence. This efficiency is particularly vital for resource-intensive tasks like video generation. Our method serves as a universal plug-and-play accelerator, achieving up to 3x speedup on Stable Diffusion v3.5 (SD3.5) and 4x on Lumina. Most notably, MAMBO-G accelerates the 14B-parameter Wan2.1 video model by 2x while preserving visual fidelity, offering a practical solution for efficient large-scale video synthesis. Our implementation follows a mainstream open-source diffusion framework and is plug-and-play with existing pipelines.

MAMBO-G: Magnitude-Aware Mitigation for Boosted Guidance

TL;DR

This work tackles instability and high compute costs in classifier-free guidance (CFG) for large-scale diffusion, particularly under zero-SNR conditions and high latent dimensionality. It proposes MAMBO-G, a training-free, magnitude-aware adaptive damping strategy that uses the ratio between conditional and unconditional velocity updates to modulate guidance strength via , stabilizing early steps and enabling faster convergence. The approach is validated across text-to-image and text-to-video tasks (SD3.5, Lumina, Wan2.1), achieving up to 3x–4x speedups while preserving, and often improving, sample fidelity as measured by ImageReward, CLIPScore, and vBench; it also proves compatible and complementary with other guidance strategies. The findings highlight magnitude-aware control as a robust, efficient component for scaling diffusion-based synthesis, with practical impact on high-resolution image and video generation in production-like pipelines.

Abstract

High-fidelity text-to-image and text-to-video generation typically relies on Classifier-Free Guidance (CFG), but achieving optimal results often demands computationally expensive sampling schedules. In this work, we propose MAMBO-G, a training-free acceleration framework that significantly reduces computational cost by dynamically optimizing guidance magnitudes. We observe that standard CFG schedules are inefficient, applying disproportionately large updates in early steps that hinder convergence speed. MAMBO-G mitigates this by modulating the guidance scale based on the update-to-prediction magnitude ratio, effectively stabilizing the trajectory and enabling rapid convergence. This efficiency is particularly vital for resource-intensive tasks like video generation. Our method serves as a universal plug-and-play accelerator, achieving up to 3x speedup on Stable Diffusion v3.5 (SD3.5) and 4x on Lumina. Most notably, MAMBO-G accelerates the 14B-parameter Wan2.1 video model by 2x while preserving visual fidelity, offering a practical solution for efficient large-scale video synthesis. Our implementation follows a mainstream open-source diffusion framework and is plug-and-play with existing pipelines.

Paper Structure

This paper contains 26 sections, 6 equations, 10 figures, 6 tables, 1 algorithm.

Figures (10)

  • Figure 1: Superior efficiency of MAMBO-G : Our method achieves comparable quality to $60$-NFE ($30$-step) CFG image generation with only $20$ NFE ($10$ steps), demonstrating a $3.0\times$ speedup over the standard CFG sampling (guidance scale = 4.0). The examples we demonstrate are not cherry-picked, and the seeds are also marked. Specific prompts can be found in \ref{['app:prompts']}.
  • Figure 2: Visual comparison across resolutions: These are Qwen-Image wu2025qwen 10-step samples. From the results, it can be seen that with the original CFG, the higher the resolution, the more unstable the model's sampling results are. With MAMBO-G , our method stabilizes the sampling process by adjusting the guidance scale at the instance-level, showing significant improvements.
  • Figure 3: Collapse of Guidance Directions at Initialization. We analyze the cosine similarity of guidance updates ($\Delta \textbf{v}$) across different noise seeds for a fixed prompt. At $t=1.0$, similarity $\approx 1.0$, indicating a generic direction independent of specific noise. As $t$ decreases ($t < 0.8$), updates rapidly diverge and become instance-specific. This observation motivates MAMBO-G to dampen the guidance scale specifically in this high-similarity, generic regime.
  • Figure 4: Dynamics of the ratio during sampling. We monitor the evolution of the relative guidance strength $r_t$ throughout the sampling process. The ratio starts at a high peak, reflecting a strong conditional influence that can lead to early-stage instability if left unregulated. It then rapidly decays and stabilizes within a few sampling steps. This empirical trend identifies the initial phase as a critical regime where guidance damping mechanism is most necessary.
  • Figure 5: Probability density of ImageReward scores across different Ratio groups. We present KDE plots comparing ImageReward scores for low-ratio versus high-ratio samples at the first sampling step. The results show that lower initial ratios yield significantly higher quality, validating the ratio as a robust indicator for predicting sampling stability.
  • ...and 5 more figures