Table of Contents
Fetching ...

UniCP: A Unified Caching and Pruning Framework for Efficient Video Generation

Wenzhang Sun, Qirui Hou, Donglin Di, Jiahui Yang, Yongjia Ma, Jianxun Cui

TL;DR

This work addresses the high computational cost of attention in diffusion-transformer-based video generation by introducing UniCP, a unified framework that combines caching and pruning across temporal and spatial dimensions. The approach comprises three components: Error-Aware Dynamic Cache Window (EDCW) to adapt cache windows to error signals, PCA-based Slicing (PCAS) to prune redundant attention components in uncachable steps, and Dynamic Weight Shift (DWS) to integrate caching and pruning seamlessly. Empirically, UniCP delivers up to 1.6× speedups on a single GPU across multiple DiT-based video models while largely preserving perceptual and structural video quality. The methods enable more practical, efficient deployment of high-fidelity video generation, with strong potential for real-time or resource-constrained applications.

Abstract

Diffusion Transformers (DiT) excel in video generation but encounter significant computational challenges due to the quadratic complexity of attention. Notably, attention differences between adjacent diffusion steps follow a U-shaped pattern. Current methods leverage this property by caching attention blocks, however, they still struggle with sudden error spikes and large discrepancies. To address these issues, we propose UniCP a unified caching and pruning framework for efficient video generation. UniCP optimizes both temporal and spatial dimensions through. Error Aware Dynamic Cache Window (EDCW): Dynamically adjusts cache window sizes for different blocks at various timesteps, adapting to abrupt error changes. PCA based Slicing (PCAS) and Dynamic Weight Shift (DWS): PCAS prunes redundant attention components, and DWS integrates caching and pruning by enabling dynamic switching between pruned and cached outputs. By adjusting cache windows and pruning redundant components, UniCP enhances computational efficiency and maintains video detail fidelity. Experimental results show that UniCP outperforms existing methods in both performance and efficiency.

UniCP: A Unified Caching and Pruning Framework for Efficient Video Generation

TL;DR

This work addresses the high computational cost of attention in diffusion-transformer-based video generation by introducing UniCP, a unified framework that combines caching and pruning across temporal and spatial dimensions. The approach comprises three components: Error-Aware Dynamic Cache Window (EDCW) to adapt cache windows to error signals, PCA-based Slicing (PCAS) to prune redundant attention components in uncachable steps, and Dynamic Weight Shift (DWS) to integrate caching and pruning seamlessly. Empirically, UniCP delivers up to 1.6× speedups on a single GPU across multiple DiT-based video models while largely preserving perceptual and structural video quality. The methods enable more practical, efficient deployment of high-fidelity video generation, with strong potential for real-time or resource-constrained applications.

Abstract

Diffusion Transformers (DiT) excel in video generation but encounter significant computational challenges due to the quadratic complexity of attention. Notably, attention differences between adjacent diffusion steps follow a U-shaped pattern. Current methods leverage this property by caching attention blocks, however, they still struggle with sudden error spikes and large discrepancies. To address these issues, we propose UniCP a unified caching and pruning framework for efficient video generation. UniCP optimizes both temporal and spatial dimensions through. Error Aware Dynamic Cache Window (EDCW): Dynamically adjusts cache window sizes for different blocks at various timesteps, adapting to abrupt error changes. PCA based Slicing (PCAS) and Dynamic Weight Shift (DWS): PCAS prunes redundant attention components, and DWS integrates caching and pruning by enabling dynamic switching between pruned and cached outputs. By adjusting cache windows and pruning redundant components, UniCP enhances computational efficiency and maintains video detail fidelity. Experimental results show that UniCP outperforms existing methods in both performance and efficiency.

Paper Structure

This paper contains 15 sections, 3 equations, 8 figures, 2 tables, 1 algorithm.

Figures (8)

  • Figure 1: Accelerating video generation methods like OpenSora, Latte, CogVideoX.
  • Figure 2: Visualization of attention differences in Open-Sora. (a) Conventional U-shaped error distribution and sudden error spikes; (b) Error accumulation in regions with sudden spikes: the left side employs the EDCW strategy, while the right side uses manually set cache window sizes; (c) Similarity of attention maps in different blocks; (d) Original attention map and sliced attention map following PCAS.
  • Figure 3: Visualization of the cache routine in EDCW. EDCW dynamically adjusts the cache window size and caching strategy based on the error threshold.
  • Figure 4: Visualization of the PCAS. PCAS reduces the computational cost of the attention mechanism by pruning redundant dimensions in the query and key matrices.
  • Figure 5: After acquiring the spatial-temporal cache map, the DWS strategy enables dynamic switching between caching and pruning strategies, allowing both processes to operate within a unified framework.
  • ...and 3 more figures