Table of Contents
Fetching ...

VidTwin: Video VAE with Decoupled Structure and Dynamics

Yuchi Wang, Junliang Guo, Xinyi Xie, Tianyu He, Xu Sun, Jiang Bian

TL;DR

VidTwin tackles the challenge of video latent representation by decoupling content and motion into Structure Latent and Dynamics Latent, enabling extreme compression ($0.20\%$) without sacrificing reconstruction quality (PSNR $28.14$) on MCL-JCV. The method employs a Spatial-Temporal Transformer backbone, with a Q-Former-based Structure Latent extractor and a spatial-averaging Dynamics Latent path, followed by an aligned decoding process. This decoupled latent design yields strong reconstruction, supports downstream generative tasks, and demonstrates scalability and explainability through ablations and cross-latent analyses. The approach also shows promising compatibility with diffusion models, reducing downstream resource requirements and enabling practical video generation with a compact latent space.

Abstract

Recent advancements in video autoencoders (Video AEs) have significantly improved the quality and efficiency of video generation. In this paper, we propose a novel and compact video autoencoder, VidTwin, that decouples video into two distinct latent spaces: Structure latent vectors, which capture overall content and global movement, and Dynamics latent vectors, which represent fine-grained details and rapid movements. Specifically, our approach leverages an Encoder-Decoder backbone, augmented with two submodules for extracting these latent spaces, respectively. The first submodule employs a Q-Former to extract low-frequency motion trends, followed by downsampling blocks to remove redundant content details. The second averages the latent vectors along the spatial dimension to capture rapid motion. Extensive experiments show that VidTwin achieves a high compression rate of 0.20% with high reconstruction quality (PSNR of 28.14 on the MCL-JCV dataset), and performs efficiently and effectively in downstream generative tasks. Moreover, our model demonstrates explainability and scalability, paving the way for future research in video latent representation and generation. Check our project page for more details: https://vidtwin.github.io/.

VidTwin: Video VAE with Decoupled Structure and Dynamics

TL;DR

VidTwin tackles the challenge of video latent representation by decoupling content and motion into Structure Latent and Dynamics Latent, enabling extreme compression () without sacrificing reconstruction quality (PSNR ) on MCL-JCV. The method employs a Spatial-Temporal Transformer backbone, with a Q-Former-based Structure Latent extractor and a spatial-averaging Dynamics Latent path, followed by an aligned decoding process. This decoupled latent design yields strong reconstruction, supports downstream generative tasks, and demonstrates scalability and explainability through ablations and cross-latent analyses. The approach also shows promising compatibility with diffusion models, reducing downstream resource requirements and enabling practical video generation with a compact latent space.

Abstract

Recent advancements in video autoencoders (Video AEs) have significantly improved the quality and efficiency of video generation. In this paper, we propose a novel and compact video autoencoder, VidTwin, that decouples video into two distinct latent spaces: Structure latent vectors, which capture overall content and global movement, and Dynamics latent vectors, which represent fine-grained details and rapid movements. Specifically, our approach leverages an Encoder-Decoder backbone, augmented with two submodules for extracting these latent spaces, respectively. The first submodule employs a Q-Former to extract low-frequency motion trends, followed by downsampling blocks to remove redundant content details. The second averages the latent vectors along the spatial dimension to capture rapid motion. Extensive experiments show that VidTwin achieves a high compression rate of 0.20% with high reconstruction quality (PSNR of 28.14 on the MCL-JCV dataset), and performs efficiently and effectively in downstream generative tasks. Moreover, our model demonstrates explainability and scalability, paving the way for future research in video latent representation and generation. Check our project page for more details: https://vidtwin.github.io/.

Paper Structure

This paper contains 49 sections, 18 equations, 9 figures, 9 tables.

Figures (9)

  • Figure 1: An example illustrating the Structure and Dynamics latents. We select two frames, $t_1$ and $t_2$, and show the original and reconstructed video frames, labeled Orig. and Recon., respectively. S. Recon. and D. Recon. refer to the reconstructed frames decoded using only the corresponding Structure or Dynamics latents. The Structure latent captures the main semantic content and overall motion trends, while the Dynamics latent encodes local details and rapid movements.
  • Figure 2: Details of our model. After obtaining the latent $z$ from the Encoder, the process branches into two flows. The Structure Latent extraction module, $\mathcal{F}_{\boldsymbol{S}}$, which consists of a Q-Former and convolutional networks, extracts the Structure Latent component $z_{\boldsymbol{S}}$. The Dynamics Latent extraction module, $\mathcal{F}_{\boldsymbol{D}}$, comprising convolutional networks and an averaging operator, extracts the Dynamics Latent component $z_{\boldsymbol{D}}$. Finally, using the decoding module, we align all latents to the same dimension and combine them before passing them into the Decoder.
  • Figure 3: Qualitative comparison with baseline methods. Two examples are presented: a gradually rotating photo and a fast-motion boxing scene. VidTwin demonstrates the ability to reconstruct fine details and accurately capture rapid motion.
  • Figure 4: An illustration of a cross-replacement example, where Video C is generated using the Structure Latent from Video A and the Dynamics Latent from Video B.
  • Figure 5: We present the FLOPs and training memory costs of the unified generative model, as applied to our model and the baselines.
  • ...and 4 more figures