Table of Contents
Fetching ...

Less is Enough: Training-Free Video Diffusion Acceleration via Runtime-Adaptive Caching

Xin Zhou, Dingkang Liang, Kaijin Chen, Tianrui Feng, Xiwu Chen, Hongkai Lin, Yikang Ding, Feiyang Tan, Hengshuang Zhao, Xiang Bai

TL;DR

The paper tackles the high computational cost of diffusion-transformer–based video generation by introducing EasyCache, a training-free, runtime-adaptive caching framework. It demonstrates that the transformation rate within diffusion steps exhibits stability, allowing previously computed transformation vectors to approximate future outputs with controllable error via a lightweight, online criterion. Through extensive experiments on OpenSora, Wan2.1, and HunyuanVideo, EasyCache achieves up to 3.3× speedups with minimal fidelity loss and outperforms prior training-free methods like TeaCache, while remaining compatible with other acceleration techniques. The approach holds practical impact for enabling faster, more accessible high-quality video synthesis across research and real-world deployments.

Abstract

Video generation models have demonstrated remarkable performance, yet their broader adoption remains constrained by slow inference speeds and substantial computational costs, primarily due to the iterative nature of the denoising process. Addressing this bottleneck is essential for democratizing advanced video synthesis technologies and enabling their integration into real-world applications. This work proposes EasyCache, a training-free acceleration framework for video diffusion models. EasyCache introduces a lightweight, runtime-adaptive caching mechanism that dynamically reuses previously computed transformation vectors, avoiding redundant computations during inference. Unlike prior approaches, EasyCache requires no offline profiling, pre-computation, or extensive parameter tuning. We conduct comprehensive studies on various large-scale video generation models, including OpenSora, Wan2.1, and HunyuanVideo. Our method achieves leading acceleration performance, reducing inference time by up to 2.1-3.3$\times$ compared to the original baselines while maintaining high visual fidelity with a significant up to 36% PSNR improvement compared to the previous SOTA method. This improvement makes our EasyCache a efficient and highly accessible solution for high-quality video generation in both research and practical applications. The code is available at https://github.com/H-EmbodVis/EasyCache.

Less is Enough: Training-Free Video Diffusion Acceleration via Runtime-Adaptive Caching

TL;DR

The paper tackles the high computational cost of diffusion-transformer–based video generation by introducing EasyCache, a training-free, runtime-adaptive caching framework. It demonstrates that the transformation rate within diffusion steps exhibits stability, allowing previously computed transformation vectors to approximate future outputs with controllable error via a lightweight, online criterion. Through extensive experiments on OpenSora, Wan2.1, and HunyuanVideo, EasyCache achieves up to 3.3× speedups with minimal fidelity loss and outperforms prior training-free methods like TeaCache, while remaining compatible with other acceleration techniques. The approach holds practical impact for enabling faster, more accessible high-quality video synthesis across research and real-world deployments.

Abstract

Video generation models have demonstrated remarkable performance, yet their broader adoption remains constrained by slow inference speeds and substantial computational costs, primarily due to the iterative nature of the denoising process. Addressing this bottleneck is essential for democratizing advanced video synthesis technologies and enabling their integration into real-world applications. This work proposes EasyCache, a training-free acceleration framework for video diffusion models. EasyCache introduces a lightweight, runtime-adaptive caching mechanism that dynamically reuses previously computed transformation vectors, avoiding redundant computations during inference. Unlike prior approaches, EasyCache requires no offline profiling, pre-computation, or extensive parameter tuning. We conduct comprehensive studies on various large-scale video generation models, including OpenSora, Wan2.1, and HunyuanVideo. Our method achieves leading acceleration performance, reducing inference time by up to 2.1-3.3 compared to the original baselines while maintaining high visual fidelity with a significant up to 36% PSNR improvement compared to the previous SOTA method. This improvement makes our EasyCache a efficient and highly accessible solution for high-quality video generation in both research and practical applications. The code is available at https://github.com/H-EmbodVis/EasyCache.

Paper Structure

This paper contains 17 sections, 7 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: The comparison between (a) the default iterative denoising, (b) static caching with fixed intervals, (c) dynamic cache with external "Dataset prior", and (d) our dynamic cache reuses computation by a runtime-adaptive criterion.
  • Figure 2: Analysis of feature dynamics. (a) The L1 norm of the input and output of each step. (b) The changes in the relative transformation rate between consecutive steps.
  • Figure 3: The overall of our method. For simplicity, we start from a computed step $i$. A runtime-adaptive criterion evaluates each subsequent step, reusing the cached transformation vector $\Delta_i$ while the accumulated deviation $E_t$ remains below a threshold $\tau$. A full computation is performed when the threshold is exceeded, as exemplified in step $j$.
  • Figure 4: Qualitative comparison of EasyCache with baseline and prior acceleration methods zhao2024realliu2025timestep. On (a) Wan2.1-14B wan2025 and (b) HunyuanVideo kong2024hunyuanvideo. Our approach consistently produces results that are closer to the original video.
  • Figure 5: The compatibility with other acceleration techniques (SVG) xi2025sparse on HunyuanVideo (129frames, 1280$\times$720).