FEB-Cache: Frequency-Guided Exposure Bias Reduction for Enhancing Diffusion Transformer Caching
Zhen Zou, Feng Zhao
TL;DR
FEB-Cache addresses the heavy compute burden of Diffusion Transformers by revealing that naive feature caching amplifies exposure bias and proposing a frequency-guided, separated caching scheme for Attn and MLP combined with adaptive noise scaling. By aligning caching with the distinct frequency roles of self-attention and MLP and by stage-aware cache selection, FEB-Cache achieves substantial speedups while preserving generation quality. The approach is validated across image and video diffusion tasks, outperforming several baselines and showing robustness through ablations, while also offering insights into exposure bias–variance dynamics and compatibility with DeepCache. The work provides a practical path to faster, high-fidelity diffusion generation and a new perspective on leveraging caching to accelerate diffusion processes.
Abstract
Diffusion Transformer (DiT) has exhibited impressive generation capabilities but faces great challenges due to its high computational complexity. To address this issue, various methods, notably feature caching, have been introduced. However, these approaches focus on aligning non-cache diffusion without analyzing why caching damage the generation processes. In this paper, we first confirm that the cache greatly amplifies the exposure bias, resulting in a decline in the generation quality. However, directly applying noise scaling is challenging for this issue due to the non-smoothness of exposure bias. We found that this phenomenon stems from the mismatch between its frequency response characteristics and the simple cache of Attention and MLP. Since these two components exhibit unique preferences for frequency signals, which provides us with a caching strategy to separate Attention and MLP to achieve an enhanced fit of exposure bias and reduce it. Based on this, we introduced FEB-Cache, a joint caching strategy that aligns with the non-exposed bias diffusion process (which gives us a higher performance cap) of caching Attention and MLP based on the frequency-guided cache table. Our approach combines a comprehensive understanding of the caching mechanism and offers a new perspective on leveraging caching to accelerate the diffusion process. Empirical results indicate that FEB-Cache optimizes model performance while concurrently facilitating acceleration. Code is available at https://github.com/aSleepyTree/EB-Cache.
