DLFR-VAE: Dynamic Latent Frame Rate VAE for Video Generation
Zhihang Yuan, Siyuan Wang, Rui Xie, Hanling Zhang, Tongcheng Fang, Yuzhang Shang, Shengen Yan, Guohao Dai, Yu Wang
TL;DR
DLFR-VAE introduces a training-free approach to dynamic latent frame-rate control for video generation by leveraging content-dependent temporal complexity. It combines a Dynamic Latent Frame Rate Scheduler with a training-free adaptation that turns pretrained VAEs into Dynamic VAEs through encoder downsampling and decoder upsampling, enabling variable frame rates across video segments. Empirical results show substantial reductions in latent-token count (about 50%), and diffusion-step latency improvements (2x–6x), with modest reconstruction quality loss, and generalizability across different pretrained VAEs and settings. This work offers a practical, plug-and-play method to accelerate video generation, with potential extensions to region-based frame rates and end-to-end training in dynamic latent spaces.
Abstract
In this paper, we propose the Dynamic Latent Frame Rate VAE (DLFR-VAE), a training-free paradigm that can make use of adaptive temporal compression in latent space. While existing video generative models apply fixed compression rates via pretrained VAE, we observe that real-world video content exhibits substantial temporal non-uniformity, with high-motion segments containing more information than static scenes. Based on this insight, DLFR-VAE dynamically adjusts the latent frame rate according to the content complexity. Specifically, DLFR-VAE comprises two core innovations: (1) A Dynamic Latent Frame Rate Scheduler that partitions videos into temporal chunks and adaptively determines optimal frame rates based on information-theoretic content complexity, and (2) A training-free adaptation mechanism that transforms pretrained VAE architectures into a dynamic VAE that can process features with variable frame rates. Our simple but effective DLFR-VAE can function as a plug-and-play module, seamlessly integrating with existing video generation models and accelerating the video generation process.
