Table of Contents
Fetching ...

Training-free Latent Inter-Frame Pruning with Attention Recovery

Dennis Menn, Yuedong Yang, Bokun Wang, Xiwen Wei, Mustafa Munir, Feng Liang, Radu Marculescu, Chenfeng Xu, Diana Marculescu

TL;DR

The Latent Inter-frame Pruning with Attention Recovery (LIPAR) framework is proposed, which detects and skips recomputing duplicated latent patches in video latent patches, and introduces a novel Attention Recovery mechanism that approximates the attention values of pruned tokens, thereby removing visual artifacts arising from naively applying the pruning method.

Abstract

Current video generation models suffer from high computational latency, making real-time applications prohibitively costly. In this paper, we address this limitation by exploiting the temporal redundancy inherent in video latent patches. To this end, we propose the Latent Inter-frame Pruning with Attention Recovery (LIPAR) framework, which detects and skips recomputing duplicated latent patches. Additionally, we introduce a novel Attention Recovery mechanism that approximates the attention values of pruned tokens, thereby removing visual artifacts arising from naively applying the pruning method. Empirically, our method increases video editing throughput by $1.45\times$, on average achieving 12.2 FPS on an NVIDIA A6000 compared to the baseline 8.4 FPS. The proposed method does not compromise generation quality and can be seamlessly integrated with the model without additional training. Our approach effectively bridges the gap between traditional compression algorithms and modern generative pipelines.

Training-free Latent Inter-Frame Pruning with Attention Recovery

TL;DR

The Latent Inter-frame Pruning with Attention Recovery (LIPAR) framework is proposed, which detects and skips recomputing duplicated latent patches in video latent patches, and introduces a novel Attention Recovery mechanism that approximates the attention values of pruned tokens, thereby removing visual artifacts arising from naively applying the pruning method.

Abstract

Current video generation models suffer from high computational latency, making real-time applications prohibitively costly. In this paper, we address this limitation by exploiting the temporal redundancy inherent in video latent patches. To this end, we propose the Latent Inter-frame Pruning with Attention Recovery (LIPAR) framework, which detects and skips recomputing duplicated latent patches. Additionally, we introduce a novel Attention Recovery mechanism that approximates the attention values of pruned tokens, thereby removing visual artifacts arising from naively applying the pruning method. Empirically, our method increases video editing throughput by , on average achieving 12.2 FPS on an NVIDIA A6000 compared to the baseline 8.4 FPS. The proposed method does not compromise generation quality and can be seamlessly integrated with the model without additional training. Our approach effectively bridges the gap between traditional compression algorithms and modern generative pipelines.
Paper Structure (36 sections, 20 equations, 13 figures, 2 tables, 2 algorithms)

This paper contains 36 sections, 20 equations, 13 figures, 2 tables, 2 algorithms.

Figures (13)

  • Figure 1: Quantitative comparison with other training-free pruning methods grouped by prune rate. Best results are highlighted in bold.
  • Figure 2: Decoding Compressed Latents. Original: Directly decode the video latents; Compressed: Compressed (nearly) unchanged latent patches.
  • Figure 3: Illustration of the approximation of pruned tokens to the unpruned token sequence. Dashed circles indicate pruned tokens, where $x_1 \approx x_2 \approx x_3$ and $x_4 \approx x_5$.
  • Figure 4: LIPAR overview: The proposed method consists of three stages: 1. Pruning 2. Attention Recovery and 3. Restoration.
  • Figure 5: Illustration of the Attention Recovery Method. This method preserves visual quality in pruned tokens via two mechanisms: M-Degree Approximation and Noise-Aware Duplication. Pruned keys ($k$) and values ($v$) are approximated by copying temporal counterparts from the clean KV-cache (e.g., $t-1$) to maintain the i.i.d. noise assumption, ensuring the $m$ closest tokens to the query remain populated. For simplicity, we only explicitly draw the Noise-Aware duplication for $k$.
  • ...and 8 more figures