UniCP: A Unified Caching and Pruning Framework for Efficient Video Generation
Wenzhang Sun, Qirui Hou, Donglin Di, Jiahui Yang, Yongjia Ma, Jianxun Cui
TL;DR
This work addresses the high computational cost of attention in diffusion-transformer-based video generation by introducing UniCP, a unified framework that combines caching and pruning across temporal and spatial dimensions. The approach comprises three components: Error-Aware Dynamic Cache Window (EDCW) to adapt cache windows to error signals, PCA-based Slicing (PCAS) to prune redundant attention components in uncachable steps, and Dynamic Weight Shift (DWS) to integrate caching and pruning seamlessly. Empirically, UniCP delivers up to 1.6× speedups on a single GPU across multiple DiT-based video models while largely preserving perceptual and structural video quality. The methods enable more practical, efficient deployment of high-fidelity video generation, with strong potential for real-time or resource-constrained applications.
Abstract
Diffusion Transformers (DiT) excel in video generation but encounter significant computational challenges due to the quadratic complexity of attention. Notably, attention differences between adjacent diffusion steps follow a U-shaped pattern. Current methods leverage this property by caching attention blocks, however, they still struggle with sudden error spikes and large discrepancies. To address these issues, we propose UniCP a unified caching and pruning framework for efficient video generation. UniCP optimizes both temporal and spatial dimensions through. Error Aware Dynamic Cache Window (EDCW): Dynamically adjusts cache window sizes for different blocks at various timesteps, adapting to abrupt error changes. PCA based Slicing (PCAS) and Dynamic Weight Shift (DWS): PCAS prunes redundant attention components, and DWS integrates caching and pruning by enabling dynamic switching between pruned and cached outputs. By adjusting cache windows and pruning redundant components, UniCP enhances computational efficiency and maintains video detail fidelity. Experimental results show that UniCP outperforms existing methods in both performance and efficiency.
