Table of Contents
Fetching ...

BWCache: Accelerating Video Diffusion Transformers through Block-Wise Caching

Hanshuai Cui, Zhiqing Tang, Zhifei Xu, Zhi Yao, Wenyi Zeng, Weijia Jia

TL;DR

This paper proposes Block-Wise Caching (BWCache), a training-free method to accelerate DiT-based video generation that dynamically caches and reuses features from DiT blocks across diffusion timesteps and introduces a similarity indicator that triggers feature reuse only when the differences between block features at adjacent timesteps fall below a threshold.

Abstract

Recent advancements in Diffusion Transformers (DiTs) have established them as the state-of-the-art method for video generation. However, their inherently sequential denoising process results in inevitable latency, limiting real-world applicability. Existing acceleration methods either compromise visual quality due to architectural modifications or fail to reuse intermediate features at proper granularity. Our analysis reveals that DiT blocks are the primary contributors to inference latency. Across diffusion timesteps, the feature variations of DiT blocks exhibit a U-shaped pattern with high similarity during intermediate timesteps, which suggests substantial computational redundancy. In this paper, we propose Block-Wise Caching (BWCache), a training-free method to accelerate DiT-based video generation. BWCache dynamically caches and reuses features from DiT blocks across diffusion timesteps. Furthermore, we introduce a similarity indicator that triggers feature reuse only when the differences between block features at adjacent timesteps fall below a threshold, thereby minimizing redundant computations while maintaining visual fidelity. Extensive experiments on several video diffusion models demonstrate that BWCache achieves up to 6$\times$ speedup with comparable visual quality.

BWCache: Accelerating Video Diffusion Transformers through Block-Wise Caching

TL;DR

This paper proposes Block-Wise Caching (BWCache), a training-free method to accelerate DiT-based video generation that dynamically caches and reuses features from DiT blocks across diffusion timesteps and introduces a similarity indicator that triggers feature reuse only when the differences between block features at adjacent timesteps fall below a threshold.

Abstract

Recent advancements in Diffusion Transformers (DiTs) have established them as the state-of-the-art method for video generation. However, their inherently sequential denoising process results in inevitable latency, limiting real-world applicability. Existing acceleration methods either compromise visual quality due to architectural modifications or fail to reuse intermediate features at proper granularity. Our analysis reveals that DiT blocks are the primary contributors to inference latency. Across diffusion timesteps, the feature variations of DiT blocks exhibit a U-shaped pattern with high similarity during intermediate timesteps, which suggests substantial computational redundancy. In this paper, we propose Block-Wise Caching (BWCache), a training-free method to accelerate DiT-based video generation. BWCache dynamically caches and reuses features from DiT blocks across diffusion timesteps. Furthermore, we introduce a similarity indicator that triggers feature reuse only when the differences between block features at adjacent timesteps fall below a threshold, thereby minimizing redundant computations while maintaining visual fidelity. Extensive experiments on several video diffusion models demonstrate that BWCache achieves up to 6 speedup with comparable visual quality.

Paper Structure

This paper contains 31 sections, 11 equations, 19 figures, 15 tables.

Figures (19)

  • Figure 1: Quality-latency comparisons for video diffusion models. Visual quality versus latency curves are presented for the proposed BWCache method, PAB, and TeaCache using Open-Sora. BWCache demonstrates significantly superior visual quality and efficiency compared to both PAB and TeaCache. Latency is evaluated on a single NVIDIA A800 GPU for generating 51 frames, 480P videos.
  • Figure 2: Overview of the BWCache. An indicator based on adjacent timestep differences in block features determines whether to reuse the cache. If conditions are met, subsequent timesteps reuse cached blocks; otherwise, blocks are recomputed and the cache updated.
  • Figure 3: Overview of DiT-based video generation models.
  • Figure 4: Analysis of the block in the DiT-based model.
  • Figure 5: Aggregation relative L1 of different models.
  • ...and 14 more figures