Table of Contents
Fetching ...

DLFR-VAE: Dynamic Latent Frame Rate VAE for Video Generation

Zhihang Yuan, Siyuan Wang, Rui Xie, Hanling Zhang, Tongcheng Fang, Yuzhang Shang, Shengen Yan, Guohao Dai, Yu Wang

TL;DR

DLFR-VAE introduces a training-free approach to dynamic latent frame-rate control for video generation by leveraging content-dependent temporal complexity. It combines a Dynamic Latent Frame Rate Scheduler with a training-free adaptation that turns pretrained VAEs into Dynamic VAEs through encoder downsampling and decoder upsampling, enabling variable frame rates across video segments. Empirical results show substantial reductions in latent-token count (about 50%), and diffusion-step latency improvements (2x–6x), with modest reconstruction quality loss, and generalizability across different pretrained VAEs and settings. This work offers a practical, plug-and-play method to accelerate video generation, with potential extensions to region-based frame rates and end-to-end training in dynamic latent spaces.

Abstract

In this paper, we propose the Dynamic Latent Frame Rate VAE (DLFR-VAE), a training-free paradigm that can make use of adaptive temporal compression in latent space. While existing video generative models apply fixed compression rates via pretrained VAE, we observe that real-world video content exhibits substantial temporal non-uniformity, with high-motion segments containing more information than static scenes. Based on this insight, DLFR-VAE dynamically adjusts the latent frame rate according to the content complexity. Specifically, DLFR-VAE comprises two core innovations: (1) A Dynamic Latent Frame Rate Scheduler that partitions videos into temporal chunks and adaptively determines optimal frame rates based on information-theoretic content complexity, and (2) A training-free adaptation mechanism that transforms pretrained VAE architectures into a dynamic VAE that can process features with variable frame rates. Our simple but effective DLFR-VAE can function as a plug-and-play module, seamlessly integrating with existing video generation models and accelerating the video generation process.

DLFR-VAE: Dynamic Latent Frame Rate VAE for Video Generation

TL;DR

DLFR-VAE introduces a training-free approach to dynamic latent frame-rate control for video generation by leveraging content-dependent temporal complexity. It combines a Dynamic Latent Frame Rate Scheduler with a training-free adaptation that turns pretrained VAEs into Dynamic VAEs through encoder downsampling and decoder upsampling, enabling variable frame rates across video segments. Empirical results show substantial reductions in latent-token count (about 50%), and diffusion-step latency improvements (2x–6x), with modest reconstruction quality loss, and generalizability across different pretrained VAEs and settings. This work offers a practical, plug-and-play method to accelerate video generation, with potential extensions to region-based frame rates and end-to-end training in dynamic latent spaces.

Abstract

In this paper, we propose the Dynamic Latent Frame Rate VAE (DLFR-VAE), a training-free paradigm that can make use of adaptive temporal compression in latent space. While existing video generative models apply fixed compression rates via pretrained VAE, we observe that real-world video content exhibits substantial temporal non-uniformity, with high-motion segments containing more information than static scenes. Based on this insight, DLFR-VAE dynamically adjusts the latent frame rate according to the content complexity. Specifically, DLFR-VAE comprises two core innovations: (1) A Dynamic Latent Frame Rate Scheduler that partitions videos into temporal chunks and adaptively determines optimal frame rates based on information-theoretic content complexity, and (2) A training-free adaptation mechanism that transforms pretrained VAE architectures into a dynamic VAE that can process features with variable frame rates. Our simple but effective DLFR-VAE can function as a plug-and-play module, seamlessly integrating with existing video generation models and accelerating the video generation process.

Paper Structure

This paper contains 31 sections, 19 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: DLFR-VAE: A training-free approach that accelerates video generation through content-adaptive spatial-temporal compression. This module can seamlessly integrate with existing pretrained video generative models.
  • Figure 2: Analysis of temporal frequency characteristics in both pixel and latent spaces. Key observations: (1) Fast-motion segments exhibit higher temporal frequency content in both domains, while static scenes show concentrated low frequency. (2) The latent space preserves the relative frequency patterns of the original signals, enabling content-adaptive frame rate compression in the latent domain.
  • Figure 3: Architecture overview of the Dynamic Latent Frame Rate (DLFR) VAE. The input video is first divided into segments. The dynamic encoder processes these segments through a series of 3D convolution layers interspersed with dynamic downsample operations (Eq.\ref{['eq:encoder']} in Sec.\ref{['subsec:dflr_vae']}), where the execution of downsample is determined by the schedule (Sec.\ref{['subsec:dflr_Scheduler']}). The resulting latent representations maintain varying temporal resolutions according to segment complexity (Sec.\ref{['subsec:dflr_space']}). The dynamic decoder then reconstructs the video through corresponding upsampling operations (Eq.\ref{['eq:decoder']} in Sec.\ref{['subsec:dflr_vae']}), restoring the original frame rate while preserving temporal consistency. Each segment can be processed at different frame rates, enabling content-adaptive temporal compression in latent space.
  • Figure 4: Content complexity experiment on HunyuanVideo VAE. The upper figure illustrates the relationship between content complexity and effective frequency, with $\epsilon=1.8$ used in this experiment. The lower figure demonstrates the alignment between content complexity and reconstruction LPIPS, indicating a strong correlation.
  • Figure 5: Comparison of the (a) original video, (b) the reconstruction result using the original HunyuanVideo VAE, and (c) the reconstruction result using our proposed DLFR-VAE. The figure illustrates the effectiveness of our dynamic frame rate adaptation in preserving video quality while reducing computational overhead. (d,e) The generated video in the dynamic latent space using the prompt: Realistic style. A man stands at a quiet bus stop on a sunny afternoon. Then, a bright yellow bus approaches. and A woman strolls into a café and approaches a wooden table. She picks up a newspaper and starts reading it.
  • ...and 3 more figures