Table of Contents
Fetching ...

Adaptive 1D Video Diffusion Autoencoder

Yao Teng, Minxuan Lin, Xian Liu, Shuai Wang, Xiao Yang, Xihui Liu

TL;DR

One-Dimensional Diffusion Video Autoencoder (One-DVA) tackles adaptive video compression by introducing 1D variable-length latent encoding within a transformer-based encoder and a pixel-space diffusion decoder. A two-stage training regime, along with latent-space alignment and decoder fine-tuning, enables high-fidelity reconstruction at varying compression levels and supports downstream latent diffusion models for video generation. The approach matches or surpasses 3D CNN VAEs in reconstruction while providing a flexible, generative latent space suitable for class-to-video and text-to-video tasks. This framework offers a practical path to adaptive, diffusion-oriented video foundation models with efficient encoding and robust generation capabilities.

Abstract

Recent video generation models largely rely on video autoencoders that compress pixel-space videos into latent representations. However, existing video autoencoders suffer from three major limitations: (1) fixed-rate compression that wastes tokens on simple videos, (2) inflexible CNN architectures that prevent variable-length latent modeling, and (3) deterministic decoders that struggle to recover appropriate details from compressed latents. To address these issues, we propose One-Dimensional Diffusion Video Autoencoder (One-DVA), a transformer-based framework for adaptive 1D encoding and diffusion-based decoding. The encoder employs query-based vision transformers to extract spatiotemporal features and produce latent representations, while a variable-length dropout mechanism dynamically adjusts the latent length. The decoder is a pixel-space diffusion transformer that reconstructs videos with the latents as input conditions. With a two-stage training strategy, One-DVA achieves performance comparable to 3D-CNN VAEs on reconstruction metrics at identical compression ratios. More importantly, it supports adaptive compression and thus can achieve higher compression ratios. To better support downstream latent generation, we further regularize the One-DVA latent distribution for generative modeling and fine-tune its decoder to mitigate artifacts caused by the generation process.

Adaptive 1D Video Diffusion Autoencoder

TL;DR

One-Dimensional Diffusion Video Autoencoder (One-DVA) tackles adaptive video compression by introducing 1D variable-length latent encoding within a transformer-based encoder and a pixel-space diffusion decoder. A two-stage training regime, along with latent-space alignment and decoder fine-tuning, enables high-fidelity reconstruction at varying compression levels and supports downstream latent diffusion models for video generation. The approach matches or surpasses 3D CNN VAEs in reconstruction while providing a flexible, generative latent space suitable for class-to-video and text-to-video tasks. This framework offers a practical path to adaptive, diffusion-oriented video foundation models with efficient encoding and robust generation capabilities.

Abstract

Recent video generation models largely rely on video autoencoders that compress pixel-space videos into latent representations. However, existing video autoencoders suffer from three major limitations: (1) fixed-rate compression that wastes tokens on simple videos, (2) inflexible CNN architectures that prevent variable-length latent modeling, and (3) deterministic decoders that struggle to recover appropriate details from compressed latents. To address these issues, we propose One-Dimensional Diffusion Video Autoencoder (One-DVA), a transformer-based framework for adaptive 1D encoding and diffusion-based decoding. The encoder employs query-based vision transformers to extract spatiotemporal features and produce latent representations, while a variable-length dropout mechanism dynamically adjusts the latent length. The decoder is a pixel-space diffusion transformer that reconstructs videos with the latents as input conditions. With a two-stage training strategy, One-DVA achieves performance comparable to 3D-CNN VAEs on reconstruction metrics at identical compression ratios. More importantly, it supports adaptive compression and thus can achieve higher compression ratios. To better support downstream latent generation, we further regularize the One-DVA latent distribution for generative modeling and fine-tune its decoder to mitigate artifacts caused by the generation process.
Paper Structure (41 sections, 5 equations, 11 figures, 6 tables)

This paper contains 41 sections, 5 equations, 11 figures, 6 tables.

Figures (11)

  • Figure 1: Overview: our One-DVA consists of an encoder, a diffusion decoder and a latent dropout module. The encoder utilizes a vision transformer with 1D queries to extract input video features and outputs low-dimensional latents. The latent dropout module dynamically adjusts the length of 1D latents during training. The diffusion decoder is a diffusion transformer generating videos in pixel space with the latents as the input condition.
  • Figure 2: Reconstruction quality across different diffusion sampling steps ($1$, $4$, $8$, and $25$) and varying 1D latent lengths.
  • Figure 3: Reconstructed videos with various 1D latent lengths. The first row shows the ground-truth (GT) videos, while the subsequent rows depict reconstructions with 1D latent lengths of $0$, $200$, $600$, and $1000$, respectively. The red dashed boxes highlight regions where reconstruction quality varies noticeably across different 1D latent lengths. We sample frames at a 5-frame interval.
  • Figure 4: Quantitative reconstruction metrics using variable-length 1D latents. Videos with greater motion exhibit a steeper PSNR decline as the 1D latent length decreases.
  • Figure 5: Text-to-video results of our latent diffusion model trained on the latent space of our autoencoder.
  • ...and 6 more figures