Table of Contents
Fetching ...

LaMD: Latent Motion Diffusion for Image-Conditional Video Generation

Yaosi Hu, Zhenzhong Chen, Chong Luo

TL;DR

LaMD reframes video generation as latent motion generation by decoupling motion from content via a Motion-Content Decomposed Video Autoencoder (MCD-VAE) and a diffusion-based Motion Generator (DMG). This two-stage approach compresses motion into a compact latent space and uses a DDPM on $z_m$, conditioned on content features and optional multimodal cues, to synthesize coherent motion which is then fused with appearance in a decoder. The method delivers high-quality image-conditional videos across BAIR, Landscape, NATOPS, MUG, and CATER-GEN with significantly faster sampling than prior video diffusion models. It also provides clear insights into motion-content decomposition and the trade-offs between latent capacity and reconstruction, suggesting strong potential for efficient, controllable video generation.

Abstract

The video generation field has witnessed rapid improvements with the introduction of recent diffusion models. While these models have successfully enhanced appearance quality, they still face challenges in generating coherent and natural movements while efficiently sampling videos. In this paper, we propose to condense video generation into a problem of motion generation, to improve the expressiveness of motion and make video generation more manageable. This can be achieved by breaking down the video generation process into latent motion generation and video reconstruction. Specifically, we present a latent motion diffusion (LaMD) framework, which consists of a motion-decomposed video autoencoder and a diffusion-based motion generator, to implement this idea. Through careful design, the motion-decomposed video autoencoder can compress patterns in movement into a concise latent motion representation. Consequently, the diffusion-based motion generator is able to efficiently generate realistic motion on a continuous latent space under multi-modal conditions, at a cost that is similar to that of image diffusion models. Results show that LaMD generates high-quality videos on various benchmark datasets, including BAIR, Landscape, NATOPS, MUG and CATER-GEN, that encompass a variety of stochastic dynamics and highly controllable movements on multiple image-conditional video generation tasks, while significantly decreases sampling time.

LaMD: Latent Motion Diffusion for Image-Conditional Video Generation

TL;DR

LaMD reframes video generation as latent motion generation by decoupling motion from content via a Motion-Content Decomposed Video Autoencoder (MCD-VAE) and a diffusion-based Motion Generator (DMG). This two-stage approach compresses motion into a compact latent space and uses a DDPM on , conditioned on content features and optional multimodal cues, to synthesize coherent motion which is then fused with appearance in a decoder. The method delivers high-quality image-conditional videos across BAIR, Landscape, NATOPS, MUG, and CATER-GEN with significantly faster sampling than prior video diffusion models. It also provides clear insights into motion-content decomposition and the trade-offs between latent capacity and reconstruction, suggesting strong potential for efficient, controllable video generation.

Abstract

The video generation field has witnessed rapid improvements with the introduction of recent diffusion models. While these models have successfully enhanced appearance quality, they still face challenges in generating coherent and natural movements while efficiently sampling videos. In this paper, we propose to condense video generation into a problem of motion generation, to improve the expressiveness of motion and make video generation more manageable. This can be achieved by breaking down the video generation process into latent motion generation and video reconstruction. Specifically, we present a latent motion diffusion (LaMD) framework, which consists of a motion-decomposed video autoencoder and a diffusion-based motion generator, to implement this idea. Through careful design, the motion-decomposed video autoencoder can compress patterns in movement into a concise latent motion representation. Consequently, the diffusion-based motion generator is able to efficiently generate realistic motion on a continuous latent space under multi-modal conditions, at a cost that is similar to that of image diffusion models. Results show that LaMD generates high-quality videos on various benchmark datasets, including BAIR, Landscape, NATOPS, MUG and CATER-GEN, that encompass a variety of stochastic dynamics and highly controllable movements on multiple image-conditional video generation tasks, while significantly decreases sampling time.
Paper Structure (22 sections, 6 equations, 10 figures, 13 tables)

This paper contains 22 sections, 6 equations, 10 figures, 13 tables.

Figures (10)

  • Figure 1: The comparison of video generation in different latent space. The dashed line stands for operations only involved in training process, while the solid line represents operations both involved in training and sampling process.
  • Figure 2: The framework of our proposed LaMD. During training process, the stage-I MCD-VAE is first trained to decompose latent motion with video reconstruction task, while DMG is trained to generate natural motion conditioned by $\left\{f_{x_0}, c\right\}$ in the second stage. During sampling process, the motion latents are first generated by DMG and then input into the decoder $\mathcal{D}_V$ together with multi-scale content features from the first given image to synthesize videos. The black dashed lines stand for operations only involved in training process.
  • Figure 3: The architecture of fusion decoder
  • Figure 4: The comparison of sampling process of different video diffusion models. Benefited from low-dimensional diffusion target and 2D-UNet based diffusion model, our latent motion diffusion achieves much faster sampling speed compared to video space diffusion and latent video diffusion. The channel dimension is omitted in all settings for simplicity.
  • Figure 5: The reconstruction results of MCD-VAE on Landscape and BAIR datasets. The original videos are contained in green boxes, while reconstructed videos in orange boxes.
  • ...and 5 more figures