Hybrid Video Diffusion Models with 2D Triplane and 3D Wavelet Representation
Kihong Kim, Haneol Lee, Jihye Park, Seyeon Kim, Kwanghee Lee, Seungryong Kim, Jaejun Yoo
TL;DR
This work introduces Hybrid Video Diffusion Models (HVDM), a framework that jointly leverages a 2D triplane latent for global context and a 3D wavelet representation for local volumetric detail to improve video generation. By fusing these hybrid latents through cross-attention and training with a combination of reconstruction, perceptual, and frequency-matching losses, HVDM achieves state-of-the-art results on benchmarks like UCF-101, SkyTimelapse, and TaiChi. The diffusion model operates in the learned latent space using a 3D U-Net denoiser, enabling long video generation, image-to-video, and video dynamics control via conditional cues. Key contributions include the hybrid autoencoder design, the use of 3D wavelet subbands to expand receptive fields, and the frequency-domain loss that enhances temporal coherence and detail, demonstrated across multiple datasets and ablation studies.
Abstract
Generating high-quality videos that synthesize desired realistic content is a challenging task due to their intricate high-dimensionality and complexity of videos. Several recent diffusion-based methods have shown comparable performance by compressing videos to a lower-dimensional latent space, using traditional video autoencoder architecture. However, such method that employ standard frame-wise 2D and 3D convolution fail to fully exploit the spatio-temporal nature of videos. To address this issue, we propose a novel hybrid video diffusion model, called HVDM, which can capture spatio-temporal dependencies more effectively. The HVDM is trained by a hybrid video autoencoder which extracts a disentangled representation of the video including: (i) a global context information captured by a 2D projected latent (ii) a local volume information captured by 3D convolutions with wavelet decomposition (iii) a frequency information for improving the video reconstruction. Based on this disentangled representation, our hybrid autoencoder provide a more comprehensive video latent enriching the generated videos with fine structures and details. Experiments on video generation benchamarks (UCF101, SkyTimelapse, and TaiChi) demonstrate that the proposed approach achieves state-of-the-art video generation quality, showing a wide range of video applications (e.g., long video generation, image-to-video, and video dynamics control).
