Table of Contents
Fetching ...

Taming Consistency Distillation for Accelerated Human Image Animation

Xiang Wang, Shiwei Zhang, Hangjie Yuan, Yujie Wei, Yingya Zhang, Changxin Gao, Yuehuan Wang, Nong Sang

TL;DR

The paper addresses the high inference cost of video diffusion for human image animation and the quality drop observed when applying naive consistency distillation. It introduces DanceLCM, a segmented trajectory distillation framework that divides the PF-ODE trajectory into $K$ segments, augmented with a lightweight auxiliary head to align predictions with real video latents, a motion-focused loss on moving regions, and explicit facial fidelity via a VAE-based face feature injected through cross-attention. The training objective combines ${\mathcal{L}}_{CD}$ and ${\mathcal{L}}_{aux}$ with ${\lambda}_{2}$, and inference occurs without classifier-free guidance, enabling efficient yet high-quality generation. Experiments on TikTok and UBC Fashion datasets show that DanceLCM achieves comparable or superior video quality with only 2–4 inference steps, substantially reducing computational burden while preserving temporal coherence and facial realism; results approach those of teacher diffusion models with many steps, and code/models will be released.

Abstract

Recent advancements in human image animation have been propelled by video diffusion models, yet their reliance on numerous iterative denoising steps results in high inference costs and slow speeds. An intuitive solution involves adopting consistency models, which serve as an effective acceleration paradigm through consistency distillation. However, simply employing this strategy in human image animation often leads to quality decline, including visual blurring, motion degradation, and facial distortion, particularly in dynamic regions. In this paper, we propose the DanceLCM approach complemented by several enhancements to improve visual quality and motion continuity at low-step regime: (1) segmented consistency distillation with an auxiliary light-weight head to incorporate supervision from real video latents, mitigating cumulative errors resulting from single full-trajectory generation; (2) a motion-focused loss to centre on motion regions, and explicit injection of facial fidelity features to improve face authenticity. Extensive qualitative and quantitative experiments demonstrate that DanceLCM achieves results comparable to state-of-the-art video diffusion models with a mere 2-4 inference steps, significantly reducing the inference burden without compromising video quality. The code and models will be made publicly available.

Taming Consistency Distillation for Accelerated Human Image Animation

TL;DR

The paper addresses the high inference cost of video diffusion for human image animation and the quality drop observed when applying naive consistency distillation. It introduces DanceLCM, a segmented trajectory distillation framework that divides the PF-ODE trajectory into segments, augmented with a lightweight auxiliary head to align predictions with real video latents, a motion-focused loss on moving regions, and explicit facial fidelity via a VAE-based face feature injected through cross-attention. The training objective combines and with , and inference occurs without classifier-free guidance, enabling efficient yet high-quality generation. Experiments on TikTok and UBC Fashion datasets show that DanceLCM achieves comparable or superior video quality with only 2–4 inference steps, substantially reducing computational burden while preserving temporal coherence and facial realism; results approach those of teacher diffusion models with many steps, and code/models will be released.

Abstract

Recent advancements in human image animation have been propelled by video diffusion models, yet their reliance on numerous iterative denoising steps results in high inference costs and slow speeds. An intuitive solution involves adopting consistency models, which serve as an effective acceleration paradigm through consistency distillation. However, simply employing this strategy in human image animation often leads to quality decline, including visual blurring, motion degradation, and facial distortion, particularly in dynamic regions. In this paper, we propose the DanceLCM approach complemented by several enhancements to improve visual quality and motion continuity at low-step regime: (1) segmented consistency distillation with an auxiliary light-weight head to incorporate supervision from real video latents, mitigating cumulative errors resulting from single full-trajectory generation; (2) a motion-focused loss to centre on motion regions, and explicit injection of facial fidelity features to improve face authenticity. Extensive qualitative and quantitative experiments demonstrate that DanceLCM achieves results comparable to state-of-the-art video diffusion models with a mere 2-4 inference steps, significantly reducing the inference burden without compromising video quality. The code and models will be made publicly available.

Paper Structure

This paper contains 12 sections, 8 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Comparative results generated by the baseline VideoLCM wang2023videolcm and our proposed DanceLCM on human image animation task.
  • Figure 2: Overall pipeline of the proposed DanceLCM. The segmented trajectory distillation is performed to transfer the knowledge of the pretrained teacher diffusion model (i.e., the ODE solver) to the student consistency model by forcing the outputs of two consistency models to be consistent. Furthermore, an auxiliary loss that aligns predicted video latents with real video latents is adopted to provide more reliable distillation supervision. Additionally, a motion-focused loss is applied to emphasize motion regions, and the facial condition is explicitly injected into the model to improve facial realism.
  • Figure 3: Qualitative evaluation on the TikTok dataset. Compared with the existing methods, the proposed DanceLCM achieves better results in terms of visual fidelity, temporal smoothness and face realism.
  • Figure 4: Qualitative ablation evaluation. Removing each component, the quality of the generated video decreases to some extent.
  • Figure 5: Ablation study on effect of facial fidelity enhancement.
  • ...and 1 more figures