FlashVideo: A Framework for Swift Inference in Text-to-Video Generation
Bin Lei, le Chen, Caiwen Ding
TL;DR
FlashVideo introduces a RetNet-based framework for fast text-to-video generation by adapting RetNet’s parallel and recurrent processing to video, enabling $O(L)$ inference and a novel Serial Number token to resolve inter-frame attention under relative positional encoding. A redundant-free frame interpolation method further accelerates frame generation by selectively interpolating only essential regions. Through experiments on UCF-101, Kinetics-600, and BAIR, FlashVideo achieves a $\sim$9.17-fold efficiency gain over autoregressive transformers and competitive video quality (FVD, LPIPS) with inference speeds on par with BERT-based transformers. The work demonstrates that RetNet can be effectively repurposed for video generation, offering substantial speedups over diffusion and autoregressive approaches while maintaining high-quality outputs, thereby enabling practical text-to-video applications.
Abstract
In the evolving field of machine learning, video generation has witnessed significant advancements with autoregressive-based transformer models and diffusion models, known for synthesizing dynamic and realistic scenes. However, these models often face challenges with prolonged inference times, even for generating short video clips such as GIFs. This paper introduces FlashVideo, a novel framework tailored for swift Text-to-Video generation. FlashVideo represents the first successful adaptation of the RetNet architecture for video generation, bringing a unique approach to the field. Leveraging the RetNet-based architecture, FlashVideo reduces the time complexity of inference from $\mathcal{O}(L^2)$ to $\mathcal{O}(L)$ for a sequence of length $L$, significantly accelerating inference speed. Additionally, we adopt a redundant-free frame interpolation method, enhancing the efficiency of frame interpolation. Our comprehensive experiments demonstrate that FlashVideo achieves a $\times9.17$ efficiency improvement over a traditional autoregressive-based transformer model, and its inference speed is of the same order of magnitude as that of BERT-based transformer models.
