Table of Contents
Fetching ...

Frequency-Aware Error-Bounded Caching for Accelerating Diffusion Transformers

Guandong Li

TL;DR

SpectralCache is proposed, a unified caching framework comprising Timestep-Aware Dynamic Scheduling (TADS), Cumulative Error Budgets (CEB), and Frequency-Decomposed Caching (FDC) that achieves 2.46x speedup with LPIPS 0.217 and SSIM 0.727, outperforming TeaCache.

Abstract

Diffusion Transformers (DiTs) have emerged as the dominant architecture for high-quality image and video generation, yet their iterative denoising process incurs substantial computational cost during inference. Existing caching methods accelerate DiTs by reusing intermediate computations across timesteps, but they share a common limitation: treating the denoising process as uniform across time,depth, and feature dimensions. In this work, we identify three orthogonal axes of non-uniformity in DiT denoising: (1) temporal -- sensitivity to caching errors varies dramatically across the denoising trajectory; (2) depth -- consecutive caching decisions lead to cascading approximation errors; and (3) feature -- different components of the hidden state exhibit heterogeneous temporal dynamics. Based on these observations, we propose SpectralCache, a unified caching framework comprising Timestep-Aware Dynamic Scheduling (TADS), Cumulative Error Budgets (CEB), and Frequency-Decomposed Caching (FDC). On FLUX.1-schnell at 512x512 resolution, SpectralCache achieves 2.46x speedup with LPIPS 0.217 and SSIM 0.727, outperforming TeaCache (2.12x, LPIPS 0.215, SSIM 0.734) by 16% in speed while maintaining comparable quality (LPIPS difference < 1%). Our approach is training-free, plug-and-play, and compatible with existing DiT architectures.

Frequency-Aware Error-Bounded Caching for Accelerating Diffusion Transformers

TL;DR

SpectralCache is proposed, a unified caching framework comprising Timestep-Aware Dynamic Scheduling (TADS), Cumulative Error Budgets (CEB), and Frequency-Decomposed Caching (FDC) that achieves 2.46x speedup with LPIPS 0.217 and SSIM 0.727, outperforming TeaCache.

Abstract

Diffusion Transformers (DiTs) have emerged as the dominant architecture for high-quality image and video generation, yet their iterative denoising process incurs substantial computational cost during inference. Existing caching methods accelerate DiTs by reusing intermediate computations across timesteps, but they share a common limitation: treating the denoising process as uniform across time,depth, and feature dimensions. In this work, we identify three orthogonal axes of non-uniformity in DiT denoising: (1) temporal -- sensitivity to caching errors varies dramatically across the denoising trajectory; (2) depth -- consecutive caching decisions lead to cascading approximation errors; and (3) feature -- different components of the hidden state exhibit heterogeneous temporal dynamics. Based on these observations, we propose SpectralCache, a unified caching framework comprising Timestep-Aware Dynamic Scheduling (TADS), Cumulative Error Budgets (CEB), and Frequency-Decomposed Caching (FDC). On FLUX.1-schnell at 512x512 resolution, SpectralCache achieves 2.46x speedup with LPIPS 0.217 and SSIM 0.727, outperforming TeaCache (2.12x, LPIPS 0.215, SSIM 0.734) by 16% in speed while maintaining comparable quality (LPIPS difference < 1%). Our approach is training-free, plug-and-play, and compatible with existing DiT architectures.
Paper Structure (61 sections, 12 equations, 6 figures, 4 tables, 1 algorithm)

This paper contains 61 sections, 12 equations, 6 figures, 4 tables, 1 algorithm.

Figures (6)

  • Figure 1: Per-timestep caching sensitivity on FLUX.1-schnell (20 steps). The L2 error $\mathcal{E}(t)$ exhibits an asymmetric U-shaped profile: early and late timesteps are highly sensitive, while the middle regime ($t_6$--$t_{14}$) is remarkably tolerant.
  • Figure 2: Error comparison between consecutive and randomly distributed block caching at matched cache rates on FLUX.1-schnell. Consecutive caching produces substantially higher L2 errors, confirming super-linear error accumulation through the residual stream.
  • Figure 3: Relative L2 change across 8 DCT frequency bands for a middle transformer block ($\ell=9$) on FLUX.1-schnell. Low-frequency components (bands 1--2) exhibit ${\sim}30\%$ higher temporal volatility than high-frequency components (bands 7--8), revealing spectral heterogeneity in hidden state dynamics.
  • Figure 4: SpectralCache framework. Input $\mathbf{H}_{t,0}$ is normalized to $\mathbf{M}_t$. TADS computes adaptive threshold $\tau^{\text{eff}}$. Three checks (CEB, FDC, distance) gate caching. If all pass, cached residual is reused; otherwise, full computation is performed.
  • Figure 5: TADS cosine bell schedule. The scaling factor $s(t)$ is small at the endpoints (conservative caching) and peaks at the midpoint (aggressive caching), aligning with the U-shaped sensitivity profile from \ref{['fig:motivation_temporal']}.
  • ...and 1 more figures

Theorems & Definitions (2)

  • proof : Proof Sketch
  • proof : Proof Sketch