Table of Contents
Fetching ...

FlashVideo: A Framework for Swift Inference in Text-to-Video Generation

Bin Lei, le Chen, Caiwen Ding

TL;DR

FlashVideo introduces a RetNet-based framework for fast text-to-video generation by adapting RetNet’s parallel and recurrent processing to video, enabling $O(L)$ inference and a novel Serial Number token to resolve inter-frame attention under relative positional encoding. A redundant-free frame interpolation method further accelerates frame generation by selectively interpolating only essential regions. Through experiments on UCF-101, Kinetics-600, and BAIR, FlashVideo achieves a $\sim$9.17-fold efficiency gain over autoregressive transformers and competitive video quality (FVD, LPIPS) with inference speeds on par with BERT-based transformers. The work demonstrates that RetNet can be effectively repurposed for video generation, offering substantial speedups over diffusion and autoregressive approaches while maintaining high-quality outputs, thereby enabling practical text-to-video applications.

Abstract

In the evolving field of machine learning, video generation has witnessed significant advancements with autoregressive-based transformer models and diffusion models, known for synthesizing dynamic and realistic scenes. However, these models often face challenges with prolonged inference times, even for generating short video clips such as GIFs. This paper introduces FlashVideo, a novel framework tailored for swift Text-to-Video generation. FlashVideo represents the first successful adaptation of the RetNet architecture for video generation, bringing a unique approach to the field. Leveraging the RetNet-based architecture, FlashVideo reduces the time complexity of inference from $\mathcal{O}(L^2)$ to $\mathcal{O}(L)$ for a sequence of length $L$, significantly accelerating inference speed. Additionally, we adopt a redundant-free frame interpolation method, enhancing the efficiency of frame interpolation. Our comprehensive experiments demonstrate that FlashVideo achieves a $\times9.17$ efficiency improvement over a traditional autoregressive-based transformer model, and its inference speed is of the same order of magnitude as that of BERT-based transformer models.

FlashVideo: A Framework for Swift Inference in Text-to-Video Generation

TL;DR

FlashVideo introduces a RetNet-based framework for fast text-to-video generation by adapting RetNet’s parallel and recurrent processing to video, enabling inference and a novel Serial Number token to resolve inter-frame attention under relative positional encoding. A redundant-free frame interpolation method further accelerates frame generation by selectively interpolating only essential regions. Through experiments on UCF-101, Kinetics-600, and BAIR, FlashVideo achieves a 9.17-fold efficiency gain over autoregressive transformers and competitive video quality (FVD, LPIPS) with inference speeds on par with BERT-based transformers. The work demonstrates that RetNet can be effectively repurposed for video generation, offering substantial speedups over diffusion and autoregressive approaches while maintaining high-quality outputs, thereby enabling practical text-to-video applications.

Abstract

In the evolving field of machine learning, video generation has witnessed significant advancements with autoregressive-based transformer models and diffusion models, known for synthesizing dynamic and realistic scenes. However, these models often face challenges with prolonged inference times, even for generating short video clips such as GIFs. This paper introduces FlashVideo, a novel framework tailored for swift Text-to-Video generation. FlashVideo represents the first successful adaptation of the RetNet architecture for video generation, bringing a unique approach to the field. Leveraging the RetNet-based architecture, FlashVideo reduces the time complexity of inference from to for a sequence of length , significantly accelerating inference speed. Additionally, we adopt a redundant-free frame interpolation method, enhancing the efficiency of frame interpolation. Our comprehensive experiments demonstrate that FlashVideo achieves a efficiency improvement over a traditional autoregressive-based transformer model, and its inference speed is of the same order of magnitude as that of BERT-based transformer models.
Paper Structure (18 sections, 2 equations, 6 figures, 3 tables)

This paper contains 18 sections, 2 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Overview of FlashVideo's Video Generation (a) Efficiency, (b) Comparison of the vision token generation methods between Autoregressive Models and FlashVideo, and (c) Quality. (a) compares the relative time taken to generate a single frame by various methods. In (b), we illustrate the reasons behind the increased efficiency of our method compared to the painful slowness of autoregressive-based transformers. (c) displays some of the frames generated by our model, showcasing the quality of the video output.
  • Figure 2: Model Overview. Pal. RetNet: RetNet Decoder Parallel Representation; Rec. RetNet: RetNet Decoder Recurrent Representation; RMS: Root Mean Square Normalization; GLU: Gated Linear Unit activation function; FC: Fully connected layer; $\oplus$: Residual connection; $N$: Number of decoders; : Input and output for the key frames generation tasks; : Input and output for the frames interpolation tasks. The illustration of the RetNet decoder is inspired by their original paper sun2023retentive.
  • Figure 3: The specific handling of the input text and serial number tokens during key steps in the video generation process. Frames with the same color border represent the same frame.
  • Figure 4: The Different Regions We Divide During the Interpolation Process. Red patches indicate the Different Tokens, orange patches denote regions of Unstable Tokens, and green sections represent Inheritable Tokens.
  • Figure 5: Qualitative evaluation. We juxtaposed the key frames generated by FlashVideo (Top row for each set) with their corresponding Groundtruth (Bottom row for each set). For each category, the initial input comprised the class label and the first 5 frames from the original video. (a) class label: Typing, (b) class label: Tai Chi, (c) class label: Lunges, (d) class label: Bending metal
  • ...and 1 more figures