Table of Contents
Fetching ...

Adaptive Spectral Feature Forecasting for Diffusion Sampling Acceleration

Jiaqi Han, Juntong Shi, Puheng Li, Haotian Ye, Qiushan Guo, Stefano Ermon

TL;DR

This work proposes spectral diffusion feature forecaster (Spectrum), a training-free approach that enables global, long-range feature reuse with tightly controlled error and theoretically reveal that this approach admits more favorable long-horizon behavior and yields an error bound that does not compound with the step size.

Abstract

Diffusion models have become the dominant tool for high-fidelity image and video generation, yet are critically bottlenecked by their inference speed due to the numerous iterative passes of Diffusion Transformers. To reduce the exhaustive compute, recent works resort to the feature caching and reusing scheme that skips network evaluations at selected diffusion steps by using cached features in previous steps. However, their preliminary design solely relies on local approximation, causing errors to grow rapidly with large skips and leading to degraded sample quality at high speedups. In this work, we propose spectral diffusion feature forecaster (Spectrum), a training-free approach that enables global, long-range feature reuse with tightly controlled error. In particular, we view the latent features of the denoiser as functions over time and approximate them with Chebyshev polynomials. Specifically, we fit the coefficient for each basis via ridge regression, which is then leveraged to forecast features at multiple future diffusion steps. We theoretically reveal that our approach admits more favorable long-horizon behavior and yields an error bound that does not compound with the step size. Extensive experiments on various state-of-the-art image and video diffusion models consistently verify the superiority of our approach. Notably, we achieve up to 4.79$\times$ speedup on FLUX.1 and 4.67$\times$ speedup on Wan2.1-14B, while maintaining much higher sample quality compared with the baselines.

Adaptive Spectral Feature Forecasting for Diffusion Sampling Acceleration

TL;DR

This work proposes spectral diffusion feature forecaster (Spectrum), a training-free approach that enables global, long-range feature reuse with tightly controlled error and theoretically reveal that this approach admits more favorable long-horizon behavior and yields an error bound that does not compound with the step size.

Abstract

Diffusion models have become the dominant tool for high-fidelity image and video generation, yet are critically bottlenecked by their inference speed due to the numerous iterative passes of Diffusion Transformers. To reduce the exhaustive compute, recent works resort to the feature caching and reusing scheme that skips network evaluations at selected diffusion steps by using cached features in previous steps. However, their preliminary design solely relies on local approximation, causing errors to grow rapidly with large skips and leading to degraded sample quality at high speedups. In this work, we propose spectral diffusion feature forecaster (Spectrum), a training-free approach that enables global, long-range feature reuse with tightly controlled error. In particular, we view the latent features of the denoiser as functions over time and approximate them with Chebyshev polynomials. Specifically, we fit the coefficient for each basis via ridge regression, which is then leveraged to forecast features at multiple future diffusion steps. We theoretically reveal that our approach admits more favorable long-horizon behavior and yields an error bound that does not compound with the step size. Extensive experiments on various state-of-the-art image and video diffusion models consistently verify the superiority of our approach. Notably, we achieve up to 4.79 speedup on FLUX.1 and 4.67 speedup on Wan2.1-14B, while maintaining much higher sample quality compared with the baselines.
Paper Structure (23 sections, 3 theorems, 33 equations, 11 figures, 8 tables, 1 algorithm)

This paper contains 23 sections, 3 theorems, 33 equations, 11 figures, 8 tables, 1 algorithm.

Key Result

Theorem 3.1

Fix an expansion point $\tau_k\in[0,1]$ and a target $\tau_j=\tau_k+ (j - k) \delta_t$. Consider the smoothness class Let $T_P[f](\tau_j)$ denote the ideal order-$P$ Taylor predictor of $f(\tau_j)$ centered at $\tau_k$ using the exact derivatives $f^{(p)}(\tau_k)$, $p\le P$. Then

Figures (11)

  • Figure 1: Qualitative comparison on text-to-image generation using FLUX.1. Spectrum aligns consistently with the 50-step reference while accelerating it by a factor of 4.79$\times$. Other baselines show noticeable degradation in color and prompt consistency.
  • Figure 2: Qualitative comparisons on HunyuanVideo. Spectrum achieves higher sample fidelity while delivering more speedup.
  • Figure 3: Qualitative comparison on text-to-video generation using Wan2.1-14B. Spectrum aligns consistently with the high-quality 50-step reference using only 14 network evaluations, while TaylorSeer is slower and exhibits noticeable artifacts on character and background.
  • Figure 4: Ablation study on the regularization weight $\lambda$.
  • Figure 5: Ablation on the degree of Chebyshev polynomials $M$.
  • ...and 6 more figures

Theorems & Definitions (5)

  • Theorem 3.1: Worst-case error for local order-$P$ Taylor
  • Theorem 3.2: Universality of Chebyshev Polynomials
  • Theorem 3.3: Error Bound of Spectrum
  • proof
  • proof