Table of Contents
Fetching ...

FEB-Cache: Frequency-Guided Exposure Bias Reduction for Enhancing Diffusion Transformer Caching

Zhen Zou, Feng Zhao

TL;DR

FEB-Cache addresses the heavy compute burden of Diffusion Transformers by revealing that naive feature caching amplifies exposure bias and proposing a frequency-guided, separated caching scheme for Attn and MLP combined with adaptive noise scaling. By aligning caching with the distinct frequency roles of self-attention and MLP and by stage-aware cache selection, FEB-Cache achieves substantial speedups while preserving generation quality. The approach is validated across image and video diffusion tasks, outperforming several baselines and showing robustness through ablations, while also offering insights into exposure bias–variance dynamics and compatibility with DeepCache. The work provides a practical path to faster, high-fidelity diffusion generation and a new perspective on leveraging caching to accelerate diffusion processes.

Abstract

Diffusion Transformer (DiT) has exhibited impressive generation capabilities but faces great challenges due to its high computational complexity. To address this issue, various methods, notably feature caching, have been introduced. However, these approaches focus on aligning non-cache diffusion without analyzing why caching damage the generation processes. In this paper, we first confirm that the cache greatly amplifies the exposure bias, resulting in a decline in the generation quality. However, directly applying noise scaling is challenging for this issue due to the non-smoothness of exposure bias. We found that this phenomenon stems from the mismatch between its frequency response characteristics and the simple cache of Attention and MLP. Since these two components exhibit unique preferences for frequency signals, which provides us with a caching strategy to separate Attention and MLP to achieve an enhanced fit of exposure bias and reduce it. Based on this, we introduced FEB-Cache, a joint caching strategy that aligns with the non-exposed bias diffusion process (which gives us a higher performance cap) of caching Attention and MLP based on the frequency-guided cache table. Our approach combines a comprehensive understanding of the caching mechanism and offers a new perspective on leveraging caching to accelerate the diffusion process. Empirical results indicate that FEB-Cache optimizes model performance while concurrently facilitating acceleration. Code is available at https://github.com/aSleepyTree/EB-Cache.

FEB-Cache: Frequency-Guided Exposure Bias Reduction for Enhancing Diffusion Transformer Caching

TL;DR

FEB-Cache addresses the heavy compute burden of Diffusion Transformers by revealing that naive feature caching amplifies exposure bias and proposing a frequency-guided, separated caching scheme for Attn and MLP combined with adaptive noise scaling. By aligning caching with the distinct frequency roles of self-attention and MLP and by stage-aware cache selection, FEB-Cache achieves substantial speedups while preserving generation quality. The approach is validated across image and video diffusion tasks, outperforming several baselines and showing robustness through ablations, while also offering insights into exposure bias–variance dynamics and compatibility with DeepCache. The work provides a practical path to faster, high-fidelity diffusion generation and a new perspective on leveraging caching to accelerate diffusion processes.

Abstract

Diffusion Transformer (DiT) has exhibited impressive generation capabilities but faces great challenges due to its high computational complexity. To address this issue, various methods, notably feature caching, have been introduced. However, these approaches focus on aligning non-cache diffusion without analyzing why caching damage the generation processes. In this paper, we first confirm that the cache greatly amplifies the exposure bias, resulting in a decline in the generation quality. However, directly applying noise scaling is challenging for this issue due to the non-smoothness of exposure bias. We found that this phenomenon stems from the mismatch between its frequency response characteristics and the simple cache of Attention and MLP. Since these two components exhibit unique preferences for frequency signals, which provides us with a caching strategy to separate Attention and MLP to achieve an enhanced fit of exposure bias and reduce it. Based on this, we introduced FEB-Cache, a joint caching strategy that aligns with the non-exposed bias diffusion process (which gives us a higher performance cap) of caching Attention and MLP based on the frequency-guided cache table. Our approach combines a comprehensive understanding of the caching mechanism and offers a new perspective on leveraging caching to accelerate the diffusion process. Empirical results indicate that FEB-Cache optimizes model performance while concurrently facilitating acceleration. Code is available at https://github.com/aSleepyTree/EB-Cache.

Paper Structure

This paper contains 33 sections, 28 equations, 19 figures, 13 tables, 1 algorithm.

Figures (19)

  • Figure 1: (a) Caching increases the SNR of images. (b) For ease of observation, we use factors larger than 1 to amplify exposure bias, which increases the SNR of the images.
  • Figure 2: (a) $L_2$ norm of low-frequency component for intermediate noisy images. (b) $L_2$ norm of high-frequency component for intermediate noisy images. Note that calculations are performed in image space decoded by VAE kingma2013auto. (c) Images generated under Attn or MLP-Cache show different frequency influence.
  • Figure 3: (s)Illustration of FEB-Cache. FEB-Cache reduces exposure bias while achieving laudable speed up. Frequency-guided Cache Table is pre-generated as shown in (b) on $n$ samples. Details can be found in Algorithm.\ref{['algorithm']}.
  • Figure 4: Qualitative comparison on ImageNet 256$\times$256 with 50 and 100 NFEs.
  • Figure 5: (a) Simply scaling the noise helps vanilla cache align with No-cache. Our method does a better job. (b) L2C also shows a tendency to align No-cache on SNR.
  • ...and 14 more figures