MAMBO-G: Magnitude-Aware Mitigation for Boosted Guidance
Shangwen Zhu, Qianyu Peng, Zhilei Shu, Yuting Hu, Zhantao Yang, Han Zhang, Zhao Pu, Andy Zheng, Xinyu Cui, Jian Zhao, Ruili Feng, Fan Cheng
TL;DR
This work tackles instability and high compute costs in classifier-free guidance (CFG) for large-scale diffusion, particularly under zero-SNR conditions and high latent dimensionality. It proposes MAMBO-G, a training-free, magnitude-aware adaptive damping strategy that uses the ratio $r_t$ between conditional and unconditional velocity updates to modulate guidance strength via $w(r_t)=1+(w_{\max}-1)\exp(-\alpha r_t)$, stabilizing early steps and enabling faster convergence. The approach is validated across text-to-image and text-to-video tasks (SD3.5, Lumina, Wan2.1), achieving up to 3x–4x speedups while preserving, and often improving, sample fidelity as measured by ImageReward, CLIPScore, and vBench; it also proves compatible and complementary with other guidance strategies. The findings highlight magnitude-aware control as a robust, efficient component for scaling diffusion-based synthesis, with practical impact on high-resolution image and video generation in production-like pipelines.
Abstract
High-fidelity text-to-image and text-to-video generation typically relies on Classifier-Free Guidance (CFG), but achieving optimal results often demands computationally expensive sampling schedules. In this work, we propose MAMBO-G, a training-free acceleration framework that significantly reduces computational cost by dynamically optimizing guidance magnitudes. We observe that standard CFG schedules are inefficient, applying disproportionately large updates in early steps that hinder convergence speed. MAMBO-G mitigates this by modulating the guidance scale based on the update-to-prediction magnitude ratio, effectively stabilizing the trajectory and enabling rapid convergence. This efficiency is particularly vital for resource-intensive tasks like video generation. Our method serves as a universal plug-and-play accelerator, achieving up to 3x speedup on Stable Diffusion v3.5 (SD3.5) and 4x on Lumina. Most notably, MAMBO-G accelerates the 14B-parameter Wan2.1 video model by 2x while preserving visual fidelity, offering a practical solution for efficient large-scale video synthesis. Our implementation follows a mainstream open-source diffusion framework and is plug-and-play with existing pipelines.
