Table of Contents
Fetching ...

SeaCache: Spectral-Evolution-Aware Cache for Accelerating Diffusion Models

Jiwoo Chung, Sangeek Hyun, MinKyu Lee, Byeongju Han, Geonho Cha, Dongyoon Wee, Youngjun Hong, Jae-Pil Heo

TL;DR

Spectral-Evolution-Aware Cache (SeaCache), a training-free cache schedule that bases reuse decisions on a spectrally aligned representation that preserves content-relevant components while suppressing noise, is introduced.

Abstract

Diffusion models are a strong backbone for visual generation, but their inherently sequential denoising process leads to slow inference. Previous methods accelerate sampling by caching and reusing intermediate outputs based on feature distances between adjacent timesteps. However, existing caching strategies typically rely on raw feature differences that entangle content and noise. This design overlooks spectral evolution, where low-frequency structure appears early and high-frequency detail is refined later. We introduce Spectral-Evolution-Aware Cache (SeaCache), a training-free cache schedule that bases reuse decisions on a spectrally aligned representation. Through theoretical and empirical analysis, we derive a Spectral-Evolution-Aware (SEA) filter that preserves content-relevant components while suppressing noise. Employing SEA-filtered input features to estimate redundancy leads to dynamic schedules that adapt to content while respecting the spectral priors underlying the diffusion model. Extensive experiments on diverse visual generative models and the baselines show that SeaCache achieves state-of-the-art latency-quality trade-offs.

SeaCache: Spectral-Evolution-Aware Cache for Accelerating Diffusion Models

TL;DR

Spectral-Evolution-Aware Cache (SeaCache), a training-free cache schedule that bases reuse decisions on a spectrally aligned representation that preserves content-relevant components while suppressing noise, is introduced.

Abstract

Diffusion models are a strong backbone for visual generation, but their inherently sequential denoising process leads to slow inference. Previous methods accelerate sampling by caching and reusing intermediate outputs based on feature distances between adjacent timesteps. However, existing caching strategies typically rely on raw feature differences that entangle content and noise. This design overlooks spectral evolution, where low-frequency structure appears early and high-frequency detail is refined later. We introduce Spectral-Evolution-Aware Cache (SeaCache), a training-free cache schedule that bases reuse decisions on a spectrally aligned representation. Through theoretical and empirical analysis, we derive a Spectral-Evolution-Aware (SEA) filter that preserves content-relevant components while suppressing noise. Employing SEA-filtered input features to estimate redundancy leads to dynamic schedules that adapt to content while respecting the spectral priors underlying the diffusion model. Extensive experiments on diverse visual generative models and the baselines show that SeaCache achieves state-of-the-art latency-quality trade-offs.
Paper Structure (29 sections, 21 equations, 13 figures, 13 tables)

This paper contains 29 sections, 21 equations, 13 figures, 13 tables.

Figures (13)

  • Figure 1: Conceptual illustration and motivation of the proposed caching scheme (SeaCache) compared with previous caching schemes. The lower panel shows a denoising trajectory of a cat image where coarse low-frequency structure appears at early steps and fine high-frequency details emerge at later steps, illustrating the spectral evolution of iterative generative models. SeaCache applies a Spectral-Evolution-Aware (SEA) Filter to raw diffusion features so that the distance measure better captures timestep-aware spectral residuals between timesteps.
  • Figure 2: Latency-quality trade-off in oracle experiments. We compare cache decisions based on raw output differences and SEA-filtered output differences (Sec. \ref{['sub:lpf']}) on FLUXflux2024labs2025flux1kontextflowmatching and Wan2.1 1.3Bwan2025. The refresh ratio is the fraction of timesteps that run a full denoiser evaluation instead of reusing cached features. For each criterion, PSNR is computed between the cached sample and the corresponding full timestep (no-cache) sample, averaged over each prompt set saharia2022photorealistichuang2024vbench. At matched refresh ratios, the filtered criterion consistently achieves higher PSNR with respect to the full-compute trajectory, validating the effectiveness of a spectrum-aware distance for cache scheduling.
  • Figure 3: Overview of SeaCache. Given input features $I_t$ and $I_{t+1}$, SeaCache first applies FFT, multiplies by the timestep-dependent SEA filters $G_t^{\mathrm{norm}}$ and $G_{t+1}^{\mathrm{norm}}$, and then applies iFFT to obtain spectral-evolution-aware features $\mathcal{P}(G_t^{\mathrm{norm}}, I_t)$ and $\mathcal{P}(G_{t+1}^{\mathrm{norm}}, I_{t+1})$ (Sec. \ref{['sub:lpf']}). A spectrum-aware dynamic caching module (Sec. \ref{['sub:schedule']}) measures the relative distance $\widetilde{\Delta}_t$ between consecutive filtered features, accumulates it over timesteps, and either reuses the cached output or refreshes the denoiser when the threshold $\delta$ is exceeded. The underlying diffusion model remains unchanged, so SeaCache acts as a plug-and-play cache policy that replaces only the distance metric.
  • Figure 4: Visualization of timestep-dependent denoising filters. (a) Optimal linear denoising responses $G_t(f)$ across timesteps, where early steps primarily pass low-frequencies and later steps gradually include higher frequencies, reflecting spectral evolution. (b) Corresponding normalized filters $G^{\mathrm{norm}}_t(f)$ with unit mean gain, which stabilize filtered feature energy across timesteps and are used as SEA filters for cache scheduling.
  • Figure 5: Relative $\ell_1$ across the generation process. Stepwise relative $\ell_1$ distances between consecutive timesteps for different feature choices, averaged over ten samples for each model. Input denotes distances on the timestep-modulated input features $I_t$. Output is the last block outputs $O_t$. SEA(Input), SEA(Output) applies the SEA filter to the input and output features, respectively. Poly(Input) corresponds to the polynomial-fitted input distance which is designed to approximate output differences from input features. SEA-filtered inputs closely track SEA-filtered outputs across timesteps, whereas other inputs show weaker alignment.
  • ...and 8 more figures