Table of Contents
Fetching ...

F3-Pruning: A Training-Free and Generalized Pruning Strategy towards Faster and Finer Text-to-Video Synthesis

Sitong Su, Jianzhi Liu, Lianli Gao, Jingkuan Song

TL;DR

This work proposes a training-free and generalized pruning strategy called F3-Pruning to prune redundant temporal attention weights of two mainstream T2V models using transformers and diffusion models.

Abstract

Recently Text-to-Video (T2V) synthesis has undergone a breakthrough by training transformers or diffusion models on large-scale datasets. Nevertheless, inferring such large models incurs huge costs.Previous inference acceleration works either require costly retraining or are model-specific.To address this issue, instead of retraining we explore the inference process of two mainstream T2V models using transformers and diffusion models.The exploration reveals the redundancy in temporal attention modules of both models, which are commonly utilized to establish temporal relations among frames.Consequently, we propose a training-free and generalized pruning strategy called F3-Pruning to prune redundant temporal attention weights.Specifically, when aggregate temporal attention values are ranked below a certain ratio, corresponding weights will be pruned.Extensive experiments on three datasets using a classic transformer-based model CogVideo and a typical diffusion-based model Tune-A-Video verify the effectiveness of F3-Pruning in inference acceleration, quality assurance and broad applicability.

F3-Pruning: A Training-Free and Generalized Pruning Strategy towards Faster and Finer Text-to-Video Synthesis

TL;DR

This work proposes a training-free and generalized pruning strategy called F3-Pruning to prune redundant temporal attention weights of two mainstream T2V models using transformers and diffusion models.

Abstract

Recently Text-to-Video (T2V) synthesis has undergone a breakthrough by training transformers or diffusion models on large-scale datasets. Nevertheless, inferring such large models incurs huge costs.Previous inference acceleration works either require costly retraining or are model-specific.To address this issue, instead of retraining we explore the inference process of two mainstream T2V models using transformers and diffusion models.The exploration reveals the redundancy in temporal attention modules of both models, which are commonly utilized to establish temporal relations among frames.Consequently, we propose a training-free and generalized pruning strategy called F3-Pruning to prune redundant temporal attention weights.Specifically, when aggregate temporal attention values are ranked below a certain ratio, corresponding weights will be pruned.Extensive experiments on three datasets using a classic transformer-based model CogVideo and a typical diffusion-based model Tune-A-Video verify the effectiveness of F3-Pruning in inference acceleration, quality assurance and broad applicability.
Paper Structure (16 sections, 4 equations, 10 figures, 5 tables)

This paper contains 16 sections, 4 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Demonstration of some visual results comparison in Text-to-Video (T2V) synthesis. Our F$^3$-Pruning is applied to the classic transformer-based method CogVideo and the typical diffusion-based method Tune-A-Video. Without any extra training, F$^3$-Pruning not only boosts inference efficiency of T2V but also enhances video quality. On the public video dataset UCF-101, applying F$^3$-Pruning to CogVideo makes it 1.35x faster and promotes video quality metrics FVD by 22%.
  • Figure 2: Overview of our proposed F$^3$-Pruning. In a), we show three attention modules Cross-model Attention (CA), Self Attention (SA) and Temporal Attention (TA), which are commonly used in T2V to respectively model text-visual alignment, visual quality within each frame and temporal coherence among frames. In b), we demonstrate the schedule of our F$^3$-Pruning applied to the transformer-based methods and the diffusion-based methods. TA weights will be pruned when the sums of TA values of some network layers or denoising timesteps, called Aggregate Attention Score, are ranked below a pruning ratio $\alpha$.
  • Figure 3: Top: Demonstration of the relation between Aggregate Attention Score (AAS) and network layers or denoising timesteps. AAS is declining with the inference step. Bottom: Attention Visualization. The diagonal line represents SA, and the upper and lower triangles represent TA. In particular, the leftmost bright line in CogVideo represents CA. As seen, attention values are sparsely distributed.
  • Figure 4: Some examples generated by five pruning methods applied to CogVideo. As demonstrated, F$^3$-Pruning performs the best in coherence and text comprehension.
  • Figure 5: Some examples generated by two pruning methods applied to the Tune-A-Video (TAV) on the datasets of LONGTEXT. As demonstrated, F$^3$-Pruning performs the best, especially in object details.
  • ...and 5 more figures