Table of Contents
Fetching ...

DeCo-VAE: Learning Compact Latents for Video Reconstruction via Decoupled Representation

Xiangchen Yin, Jiahui Yuan, Zhangchi Hu, Wenzhang Sun, Jie Chen, Xiaozhen Qiao, Hao Li, Xiaoyan Sun

TL;DR

DeCo-VAE addresses redundancy in video VAEs by decoupling content into keyframe, motion, and residual components with dedicated encoders and a shared 3D decoder. It introduces a decoupled adaptation training strategy to stabilize learning and promote accurate static and dynamic feature learning. Across WebVid and Kinetics-400, it achieves superior reconstruction with a lightweight latent representation and yields strong downstream generation performance when integrated with Latte diffusion. The design provides interpretable latent factors and practical efficiency, though long-video sequences remain a challenge, motivating future work on multi-keyframe decoupling and local temporal refinement.

Abstract

Existing video Variational Autoencoders (VAEs) generally overlook the similarity between frame contents, leading to redundant latent modeling. In this paper, we propose decoupled VAE (DeCo-VAE) to achieve compact latent representation. Instead of encoding RGB pixels directly, we decompose video content into distinct components via explicit decoupling: keyframe, motion and residual, and learn dedicated latent representation for each. To avoid cross-component interference, we design dedicated encoders for each decoupled component and adopt a shared 3D decoder to maintain spatiotemporal consistency during reconstruction. We further utilize a decoupled adaptation strategy that freezes partial encoders while training the others sequentially, ensuring stable training and accurate learning of both static and dynamic features. Extensive quantitative and qualitative experiments demonstrate that DeCo-VAE achieves superior video reconstruction performance.

DeCo-VAE: Learning Compact Latents for Video Reconstruction via Decoupled Representation

TL;DR

DeCo-VAE addresses redundancy in video VAEs by decoupling content into keyframe, motion, and residual components with dedicated encoders and a shared 3D decoder. It introduces a decoupled adaptation training strategy to stabilize learning and promote accurate static and dynamic feature learning. Across WebVid and Kinetics-400, it achieves superior reconstruction with a lightweight latent representation and yields strong downstream generation performance when integrated with Latte diffusion. The design provides interpretable latent factors and practical efficiency, though long-video sequences remain a challenge, motivating future work on multi-keyframe decoupling and local temporal refinement.

Abstract

Existing video Variational Autoencoders (VAEs) generally overlook the similarity between frame contents, leading to redundant latent modeling. In this paper, we propose decoupled VAE (DeCo-VAE) to achieve compact latent representation. Instead of encoding RGB pixels directly, we decompose video content into distinct components via explicit decoupling: keyframe, motion and residual, and learn dedicated latent representation for each. To avoid cross-component interference, we design dedicated encoders for each decoupled component and adopt a shared 3D decoder to maintain spatiotemporal consistency during reconstruction. We further utilize a decoupled adaptation strategy that freezes partial encoders while training the others sequentially, ensuring stable training and accurate learning of both static and dynamic features. Extensive quantitative and qualitative experiments demonstrate that DeCo-VAE achieves superior video reconstruction performance.

Paper Structure

This paper contains 21 sections, 9 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: (a) Visualization of decoupled components in DeCo-VAE, including keyframe, motion and residual components for video frames. (b) Visualization of t-SNE latent distributions in video decoupling, our DeCo-VAE achieves more compact latent space. (c) Performance comparison of video VAEs, our DeCo-VAE achieves superior reconstruction quality with lightweight parameters.
  • Figure 2: Overview of the proposed DeCo-VAE. (a) DeCo-VAE pipeline decomposes video sequences into keyframe, motion, and residual, via dedicated encoders and a shared 3D decoder. (b) With the keyframe as reference, subsequent frames (with keyframe) are inputs to a motion module for motion components, motion compensation generates predicted frames, and residuals are obtained by subtracting predicted frames from keyframe. (c) Decoupled adaptation strategy stabilizes training and enhances temporal consistency.
  • Figure 3: Visualization of decoupled components and their VAE reconstructions. We showed original video frames, raw decoupled components (residual, motion), and their reconstructions by DeCo-VAE. Close alignment confirms the model’s ability to precisely reconstruct distinct decoupled features.
  • Figure 4: Video reconstruction results of different methods. We compared the original video with outputs of VidTwin vidtwin, CV-VAE cvvae, LeanVAE leanvae, and our DeCo-VAE across three video sequences. Our method achieved superior reconstruction aligned with the original.
  • Figure 5: Visualization of keyframes and their VAE reconstructions. Our keyframe encoder achieved good latent representation and could reconstruct clear keyframe to recouple the components.