Table of Contents
Fetching ...

Motion Consistency Model: Accelerating Video Diffusion with Disentangled Motion-Appearance Distillation

Yuanhao Zhai, Kevin Lin, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Chung-Ching Lin, David Doermann, Junsong Yuan, Lijuan Wang

TL;DR

The proposed motion consistency model (MCM), a single-stage video diffusion distillation method that disentangles motion and appearance learning, achieves the state-of-the-art video diffusion distillation performance and can enhance frame quality in video diffusion models.

Abstract

Image diffusion distillation achieves high-fidelity generation with very few sampling steps. However, applying these techniques directly to video diffusion often results in unsatisfactory frame quality due to the limited visual quality in public video datasets. This affects the performance of both teacher and student video diffusion models. Our study aims to improve video diffusion distillation while improving frame appearance using abundant high-quality image data. We propose motion consistency model (MCM), a single-stage video diffusion distillation method that disentangles motion and appearance learning. Specifically, MCM includes a video consistency model that distills motion from the video teacher model, and an image discriminator that enhances frame appearance to match high-quality image data. This combination presents two challenges: (1) conflicting frame learning objectives, as video distillation learns from low-quality video frames while the image discriminator targets high-quality images; and (2) training-inference discrepancies due to the differing quality of video samples used during training and inference. To address these challenges, we introduce disentangled motion distillation and mixed trajectory distillation. The former applies the distillation objective solely to the motion representation, while the latter mitigates training-inference discrepancies by mixing distillation trajectories from both the low- and high-quality video domains. Extensive experiments show that our MCM achieves the state-of-the-art video diffusion distillation performance. Additionally, our method can enhance frame quality in video diffusion models, producing frames with high aesthetic scores or specific styles without corresponding video data.

Motion Consistency Model: Accelerating Video Diffusion with Disentangled Motion-Appearance Distillation

TL;DR

The proposed motion consistency model (MCM), a single-stage video diffusion distillation method that disentangles motion and appearance learning, achieves the state-of-the-art video diffusion distillation performance and can enhance frame quality in video diffusion models.

Abstract

Image diffusion distillation achieves high-fidelity generation with very few sampling steps. However, applying these techniques directly to video diffusion often results in unsatisfactory frame quality due to the limited visual quality in public video datasets. This affects the performance of both teacher and student video diffusion models. Our study aims to improve video diffusion distillation while improving frame appearance using abundant high-quality image data. We propose motion consistency model (MCM), a single-stage video diffusion distillation method that disentangles motion and appearance learning. Specifically, MCM includes a video consistency model that distills motion from the video teacher model, and an image discriminator that enhances frame appearance to match high-quality image data. This combination presents two challenges: (1) conflicting frame learning objectives, as video distillation learns from low-quality video frames while the image discriminator targets high-quality images; and (2) training-inference discrepancies due to the differing quality of video samples used during training and inference. To address these challenges, we introduce disentangled motion distillation and mixed trajectory distillation. The former applies the distillation objective solely to the motion representation, while the latter mitigates training-inference discrepancies by mixing distillation trajectories from both the low- and high-quality video domains. Extensive experiments show that our MCM achieves the state-of-the-art video diffusion distillation performance. Additionally, our method can enhance frame quality in video diffusion models, producing frames with high aesthetic scores or specific styles without corresponding video data.
Paper Structure (18 sections, 9 equations, 9 figures, 6 tables)

This paper contains 18 sections, 9 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Qualitative result comparisons with latent consistency model (LCM) luo2023latent using ModelScopeT2V wang2023modelscope as teacher. Our MCM outputs clearer video frames using few-step sampling, and can also adapt to different image styles using additional image datasets. Corresponding video results are in supplementary materials.
  • Figure 2: Illustration of our motion consistency model distillation process, which not only distills the motion prior from teacher to accelerate sampling, but also can benefit from an additional high-quality image dataset to improve the frame quality of generated videos.
  • Figure 3: Left: framework overview. Our MCM features disentangled motion-appearance distillation, where motion is learned via the motion consistency distillation loss $\mathcal{L}_{\text{MCD}}$, and the appearance is learned with the frame adversarial objective $\mathcal{L}_{\text{adv}}^{\text{G}}$. Right: mixed trajectory distillation. We simulate the inference-time ODE trajectory using student-generated video (bottom green line), which is mixed with the real video ODE trajectory (top green line) for consistency distillation training.
  • Figure 4: Learnable motion representation.
  • Figure 5: Qualitative comparison of video diffusion distillation with AnimateDiff guo2023animatediff as the teacher model. The first and last frames are sampled for visualization. Our MCM produces cleaner frames using only 2 and 4 sampling steps, with better prompt alignment and improved frame details. Corresponding video results are in supplementary materials.
  • ...and 4 more figures