Table of Contents
Fetching ...

ERTACache: Error Rectification and Timesteps Adjustment for Efficient Diffusion

Xurui Peng, Chenqian Yan, Hong Liu, Rui Ma, Fangmin Chen, Xing Wang, Zhihua Wu, Songwei Liu, Mingbao Lin

TL;DR

Diffusion models incur high inference cost due to iterative sampling. The paper analyzes cache-based acceleration and decomposes cache-induced degradation into feature shift and step amplification errors. It introduces ERTACache, combining offline residual profiling, trajectory-aware timestep adjustment, and explicit residual rectification to enable accurate, aggressive cache reuse. Empirical results on image and video benchmarks demonstrate consistent speedups up to about 2x with preserved or improved fidelity across several state-of-the-art diffusion backbones, including Wan2.1. The work provides a practical, theory-grounded path to efficient diffusion sampling.

Abstract

Diffusion models suffer from substantial computational overhead due to their inherently iterative inference process. While feature caching offers a promising acceleration strategy by reusing intermediate outputs across timesteps, naive reuse often incurs noticeable quality degradation. In this work, we formally analyze the cumulative error introduced by caching and decompose it into two principal components: feature shift error, caused by inaccuracies in cached outputs, and step amplification error, which arises from error propagation under fixed timestep schedules. To address these issues, we propose ERTACache, a principled caching framework that jointly rectifies both error types. Our method employs an offline residual profiling stage to identify reusable steps, dynamically adjusts integration intervals via a trajectory-aware correction coefficient, and analytically approximates cache-induced errors through a closed-form residual linearization model. Together, these components enable accurate and efficient sampling under aggressive cache reuse. Extensive experiments across standard image and video generation benchmarks show that ERTACache achieves up to 2x inference speedup while consistently preserving or even improving visual quality. Notably, on the state-of-the-art Wan2.1 video diffusion model, ERTACache delivers 2x acceleration with minimal VBench degradation, effectively maintaining baseline fidelity while significantly improving efficiency. The code is available at https://github.com/bytedance/ERTACache.

ERTACache: Error Rectification and Timesteps Adjustment for Efficient Diffusion

TL;DR

Diffusion models incur high inference cost due to iterative sampling. The paper analyzes cache-based acceleration and decomposes cache-induced degradation into feature shift and step amplification errors. It introduces ERTACache, combining offline residual profiling, trajectory-aware timestep adjustment, and explicit residual rectification to enable accurate, aggressive cache reuse. Empirical results on image and video benchmarks demonstrate consistent speedups up to about 2x with preserved or improved fidelity across several state-of-the-art diffusion backbones, including Wan2.1. The work provides a practical, theory-grounded path to efficient diffusion sampling.

Abstract

Diffusion models suffer from substantial computational overhead due to their inherently iterative inference process. While feature caching offers a promising acceleration strategy by reusing intermediate outputs across timesteps, naive reuse often incurs noticeable quality degradation. In this work, we formally analyze the cumulative error introduced by caching and decompose it into two principal components: feature shift error, caused by inaccuracies in cached outputs, and step amplification error, which arises from error propagation under fixed timestep schedules. To address these issues, we propose ERTACache, a principled caching framework that jointly rectifies both error types. Our method employs an offline residual profiling stage to identify reusable steps, dynamically adjusts integration intervals via a trajectory-aware correction coefficient, and analytically approximates cache-induced errors through a closed-form residual linearization model. Together, these components enable accurate and efficient sampling under aggressive cache reuse. Extensive experiments across standard image and video generation benchmarks show that ERTACache achieves up to 2x inference speedup while consistently preserving or even improving visual quality. Notably, on the state-of-the-art Wan2.1 video diffusion model, ERTACache delivers 2x acceleration with minimal VBench degradation, effectively maintaining baseline fidelity while significantly improving efficiency. The code is available at https://github.com/bytedance/ERTACache.

Paper Structure

This paper contains 37 sections, 33 equations, 6 figures, 4 tables, 2 algorithms.

Figures (6)

  • Figure 1: Framework of our proposed ERTACache.
  • Figure 2: (a) The ground-truth $\ell_1$ distance (blue) between real cached and computed features shows minor variation across timesteps. In contrast, Tea-Cache's predicted $\ell_1$ distance (orange) remains consistent across prompts but diverges significantly from ground-truth in later steps, indicating growing prediction error over time. (b) ODE trajectories with and without timestep adjustment.
  • Figure 3: Comparison of visual quality and computational efficiency against competing approaches, illustrated by the first and last frames of generated video sequences.
  • Figure 4: Visualization effects of each strategy in ERTACache.
  • Figure 5: Illustration of different metrics using different number of prompts with timestep adjustment.
  • ...and 1 more figures