Table of Contents
Fetching ...

From Reusing to Forecasting: Accelerating Diffusion Models with TaylorSeers

Jiacheng Liu, Chang Zou, Yuanhuiyi Lyu, Junjie Chen, Linfeng Zhang

TL;DR

TaylorSeer reframes diffusion-model acceleration from naive feature reuse to forecasting future features along a stable temporal trajectory. By storing higher-order finite differences and applying multi-order Taylor expansions, it predicts intermediate representations with controllable error, enabling near-training-free, high-ratio speedups. The method yields substantial gains across image and video synthesis (up to ~5×) while maintaining or improving quality compared to prior caching approaches. This cache-then-forecast paradigm offers a robust, scalable route to real-time diffusion-based generation on DiT, FLUX, and HunyuanVideo.

Abstract

Diffusion Transformers (DiT) have revolutionized high-fidelity image and video synthesis, yet their computational demands remain prohibitive for real-time applications. To solve this problem, feature caching has been proposed to accelerate diffusion models by caching the features in the previous timesteps and then reusing them in the following timesteps. However, at timesteps with significant intervals, the feature similarity in diffusion models decreases substantially, leading to a pronounced increase in errors introduced by feature caching, significantly harming the generation quality. To solve this problem, we propose TaylorSeer, which firstly shows that features of diffusion models at future timesteps can be predicted based on their values at previous timesteps. Based on the fact that features change slowly and continuously across timesteps, TaylorSeer employs a differential method to approximate the higher-order derivatives of features and predict features in future timesteps with Taylor series expansion. Extensive experiments demonstrate its significant effectiveness in both image and video synthesis, especially in high acceleration ratios. For instance, it achieves an almost lossless acceleration of 4.99$\times$ on FLUX and 5.00$\times$ on HunyuanVideo without additional training. On DiT, it achieves $3.41$ lower FID compared with previous SOTA at $4.53$$\times$ acceleration. %Our code is provided in the supplementary materials and will be made publicly available on GitHub. Our codes have been released in Github:https://github.com/Shenyi-Z/TaylorSeer

From Reusing to Forecasting: Accelerating Diffusion Models with TaylorSeers

TL;DR

TaylorSeer reframes diffusion-model acceleration from naive feature reuse to forecasting future features along a stable temporal trajectory. By storing higher-order finite differences and applying multi-order Taylor expansions, it predicts intermediate representations with controllable error, enabling near-training-free, high-ratio speedups. The method yields substantial gains across image and video synthesis (up to ~5×) while maintaining or improving quality compared to prior caching approaches. This cache-then-forecast paradigm offers a robust, scalable route to real-time diffusion-based generation on DiT, FLUX, and HunyuanVideo.

Abstract

Diffusion Transformers (DiT) have revolutionized high-fidelity image and video synthesis, yet their computational demands remain prohibitive for real-time applications. To solve this problem, feature caching has been proposed to accelerate diffusion models by caching the features in the previous timesteps and then reusing them in the following timesteps. However, at timesteps with significant intervals, the feature similarity in diffusion models decreases substantially, leading to a pronounced increase in errors introduced by feature caching, significantly harming the generation quality. To solve this problem, we propose TaylorSeer, which firstly shows that features of diffusion models at future timesteps can be predicted based on their values at previous timesteps. Based on the fact that features change slowly and continuously across timesteps, TaylorSeer employs a differential method to approximate the higher-order derivatives of features and predict features in future timesteps with Taylor series expansion. Extensive experiments demonstrate its significant effectiveness in both image and video synthesis, especially in high acceleration ratios. For instance, it achieves an almost lossless acceleration of 4.99 on FLUX and 5.00 on HunyuanVideo without additional training. On DiT, it achieves lower FID compared with previous SOTA at acceleration. %Our code is provided in the supplementary materials and will be made publicly available on GitHub. Our codes have been released in Github:https://github.com/Shenyi-Z/TaylorSeer

Paper Structure

This paper contains 36 sections, 12 equations, 14 figures, 5 tables.

Figures (14)

  • Figure 1: PCA projections of features and their derivatives in diffusion models. Both the features (a) and derivatives of features (b) in diffusion models at different timesteps form stable trajectories, indicating it possible to predict features of diffusion models at future timesteps based on features from previous timesteps.
  • Figure 2: Comparison between previous caching methods and TaylorSeer. TaylorSeer shows significantly better performance at high-acceleration ratios.
  • Figure 3: An overview of TaylorSeer.$\quad$(a) TaylorSeer (Order=0)Na√Øve feature caching, which directly reuses computed features across timesteps.$\quad$(b) TaylorSeer (Order=1)Linear prediction, which estimates feature trajectories using first-order finite differences.$\quad$(c) TaylorSeer (Order=2) extends linear prediction to 2-order finite differences for more accurate modeling of nonlinear feature trajectories. (d) TaylorSeer (Order=M) further extends to M-orders for improved accuracy without sacrificing efficiency.
  • Figure 4: Detailed visualization results for different acceleration methods on FLUX.1-dev. Other methods exhibit issues such as text failure, color distortion, and missing details, whereas TaylorSeer achieves the best quality and acceleration.
  • Figure 5: Visualization of different acceleration methods on HunyuanVideo. While achieving higher acceleration ratios, other methods exhibit issues such as text errors, missing content, and motion detail loss. In contrast, our method demonstrates superior performance, maintaining high-quality generation without these problems.
  • ...and 9 more figures