Table of Contents
Fetching ...

Hybrid Video Diffusion Models with 2D Triplane and 3D Wavelet Representation

Kihong Kim, Haneol Lee, Jihye Park, Seyeon Kim, Kwanghee Lee, Seungryong Kim, Jaejun Yoo

TL;DR

This work introduces Hybrid Video Diffusion Models (HVDM), a framework that jointly leverages a 2D triplane latent for global context and a 3D wavelet representation for local volumetric detail to improve video generation. By fusing these hybrid latents through cross-attention and training with a combination of reconstruction, perceptual, and frequency-matching losses, HVDM achieves state-of-the-art results on benchmarks like UCF-101, SkyTimelapse, and TaiChi. The diffusion model operates in the learned latent space using a 3D U-Net denoiser, enabling long video generation, image-to-video, and video dynamics control via conditional cues. Key contributions include the hybrid autoencoder design, the use of 3D wavelet subbands to expand receptive fields, and the frequency-domain loss that enhances temporal coherence and detail, demonstrated across multiple datasets and ablation studies.

Abstract

Generating high-quality videos that synthesize desired realistic content is a challenging task due to their intricate high-dimensionality and complexity of videos. Several recent diffusion-based methods have shown comparable performance by compressing videos to a lower-dimensional latent space, using traditional video autoencoder architecture. However, such method that employ standard frame-wise 2D and 3D convolution fail to fully exploit the spatio-temporal nature of videos. To address this issue, we propose a novel hybrid video diffusion model, called HVDM, which can capture spatio-temporal dependencies more effectively. The HVDM is trained by a hybrid video autoencoder which extracts a disentangled representation of the video including: (i) a global context information captured by a 2D projected latent (ii) a local volume information captured by 3D convolutions with wavelet decomposition (iii) a frequency information for improving the video reconstruction. Based on this disentangled representation, our hybrid autoencoder provide a more comprehensive video latent enriching the generated videos with fine structures and details. Experiments on video generation benchamarks (UCF101, SkyTimelapse, and TaiChi) demonstrate that the proposed approach achieves state-of-the-art video generation quality, showing a wide range of video applications (e.g., long video generation, image-to-video, and video dynamics control).

Hybrid Video Diffusion Models with 2D Triplane and 3D Wavelet Representation

TL;DR

This work introduces Hybrid Video Diffusion Models (HVDM), a framework that jointly leverages a 2D triplane latent for global context and a 3D wavelet representation for local volumetric detail to improve video generation. By fusing these hybrid latents through cross-attention and training with a combination of reconstruction, perceptual, and frequency-matching losses, HVDM achieves state-of-the-art results on benchmarks like UCF-101, SkyTimelapse, and TaiChi. The diffusion model operates in the learned latent space using a 3D U-Net denoiser, enabling long video generation, image-to-video, and video dynamics control via conditional cues. Key contributions include the hybrid autoencoder design, the use of 3D wavelet subbands to expand receptive fields, and the frequency-domain loss that enhances temporal coherence and detail, demonstrated across multiple datasets and ablation studies.

Abstract

Generating high-quality videos that synthesize desired realistic content is a challenging task due to their intricate high-dimensionality and complexity of videos. Several recent diffusion-based methods have shown comparable performance by compressing videos to a lower-dimensional latent space, using traditional video autoencoder architecture. However, such method that employ standard frame-wise 2D and 3D convolution fail to fully exploit the spatio-temporal nature of videos. To address this issue, we propose a novel hybrid video diffusion model, called HVDM, which can capture spatio-temporal dependencies more effectively. The HVDM is trained by a hybrid video autoencoder which extracts a disentangled representation of the video including: (i) a global context information captured by a 2D projected latent (ii) a local volume information captured by 3D convolutions with wavelet decomposition (iii) a frequency information for improving the video reconstruction. Based on this disentangled representation, our hybrid autoencoder provide a more comprehensive video latent enriching the generated videos with fine structures and details. Experiments on video generation benchamarks (UCF101, SkyTimelapse, and TaiChi) demonstrate that the proposed approach achieves state-of-the-art video generation quality, showing a wide range of video applications (e.g., long video generation, image-to-video, and video dynamics control).
Paper Structure (45 sections, 17 equations, 19 figures, 8 tables)

This paper contains 45 sections, 17 equations, 19 figures, 8 tables.

Figures (19)

  • Figure 1: Overview of our hybrid video autoencoder in HVDM that combines a 2D triplane and 3D volume representation for video encoding. The 2D triplane representation provide global context and 3D volume representation provide local volume information of video. The spatio-temporal cross-attention module incorporates these distinctive features to organizes fine-grained video representation.
  • Figure 2: Visualization of 3D wavelet transform. The volume of video is decomposed into eight subband ($\mathbf{x}_\mathrm{lll}, \ldots, \mathbf{x}_\mathrm{hhh}$) including low and high frequency components.
  • Figure 3: Applications for diverse video generator by our proposed HVDM. Our HVDM is adaptable for diverse video generation tasks depending on the type of condition latent $z_c$. During training process, the condition latent $z_c$ is extracted from the video and jointly trained with the noised hybrid video latent $z_{t}$. During the sampling process, these condition latent are supported to enable various video generation tasks, such as long video generation, image-to-video, and video dynamics control.
  • Figure 4: Qualitative reconstruction results on SkyTimelapse xiong2018learning dataset siarohin2019first. Our HVDM produces distinct and sharp edge details and faithfully expresses a fine structures in a video sequence. Additional samples are included in appendix.
  • Figure 4: Ablation study on feature fusing module.
  • ...and 14 more figures