Table of Contents
Fetching ...

MagicDistillation: Weak-to-Strong Video Distillation for Large-Scale Few-Step Synthesis

Shitong Shao, Hongwei Yi, Hanzhong Guo, Tian Ye, Daquan Zhou, Michael Lingelbach, Zhiqiang Xu, Zeke Xie

TL;DR

MagicDistillation tackles the high inference cost and portrait fidelity gaps of large-scale video diffusion models by integrating weak-to-strong distribution matching with LoRA-based fine-tuning within a distribution distillation framework. It introduces ground-truth supervision to stabilize training and improve visual fidelity, enabling high-quality portrait video synthesis with only 4 steps (NFE=4) while surpassing several strong baselines. Empirical results across portrait benchmarks and TI2V/I2V tasks demonstrate improved FID/FVD and motion dynamics, showing practical potential for real-time or near-real-time portrait video generation. The approach combines a DiT-based discriminator, a 4-step distilled generator, and data from high-quality talking and general videos, delivering robust performance with scalable efficiency.

Abstract

Recently, open-source video diffusion models (VDMs), such as WanX, Magic141 and HunyuanVideo, have been scaled to over 10 billion parameters. These large-scale VDMs have demonstrated significant improvements over smaller-scale VDMs across multiple dimensions, including enhanced visual quality and more natural motion dynamics. However, these models face two major limitations: (1) High inference overhead: Large-scale VDMs require approximately 10 minutes to synthesize a 28-step video on a single H100 GPU. (2) Limited in portrait video synthesis: Models like WanX-I2V and HunyuanVideo-I2V often produce unnatural facial expressions and movements in portrait videos. To address these challenges, we propose MagicDistillation, a novel framework designed to reduce inference overhead while ensuring the generalization of VDMs for portrait video synthesis. Specifically, we primarily use sufficiently high-quality talking video to fine-tune Magic141, which is dedicated to portrait video synthesis. We then employ LoRA to effectively and efficiently fine-tune the fake DiT within the step distillation framework known as distribution matching distillation (DMD). Following this, we apply weak-to-strong (W2S) distribution matching and minimize the discrepancy between the fake data distribution and the ground truth distribution, thereby improving the visual fidelity and motion dynamics of the synthesized videos. Experimental results on portrait video synthesis demonstrate the effectiveness of MagicDistillation, as our method surpasses Euler, LCM, and DMD baselines in both FID/FVD metrics and VBench. Moreover, MagicDistillation, requiring only 4 steps, also outperforms WanX-I2V (14B) and HunyuanVideo-I2V (13B) on visualization and VBench. Our project page is https://magicdistillation.github.io/MagicDistillation/.

MagicDistillation: Weak-to-Strong Video Distillation for Large-Scale Few-Step Synthesis

TL;DR

MagicDistillation tackles the high inference cost and portrait fidelity gaps of large-scale video diffusion models by integrating weak-to-strong distribution matching with LoRA-based fine-tuning within a distribution distillation framework. It introduces ground-truth supervision to stabilize training and improve visual fidelity, enabling high-quality portrait video synthesis with only 4 steps (NFE=4) while surpassing several strong baselines. Empirical results across portrait benchmarks and TI2V/I2V tasks demonstrate improved FID/FVD and motion dynamics, showing practical potential for real-time or near-real-time portrait video generation. The approach combines a DiT-based discriminator, a 4-step distilled generator, and data from high-quality talking and general videos, delivering robust performance with scalable efficiency.

Abstract

Recently, open-source video diffusion models (VDMs), such as WanX, Magic141 and HunyuanVideo, have been scaled to over 10 billion parameters. These large-scale VDMs have demonstrated significant improvements over smaller-scale VDMs across multiple dimensions, including enhanced visual quality and more natural motion dynamics. However, these models face two major limitations: (1) High inference overhead: Large-scale VDMs require approximately 10 minutes to synthesize a 28-step video on a single H100 GPU. (2) Limited in portrait video synthesis: Models like WanX-I2V and HunyuanVideo-I2V often produce unnatural facial expressions and movements in portrait videos. To address these challenges, we propose MagicDistillation, a novel framework designed to reduce inference overhead while ensuring the generalization of VDMs for portrait video synthesis. Specifically, we primarily use sufficiently high-quality talking video to fine-tune Magic141, which is dedicated to portrait video synthesis. We then employ LoRA to effectively and efficiently fine-tune the fake DiT within the step distillation framework known as distribution matching distillation (DMD). Following this, we apply weak-to-strong (W2S) distribution matching and minimize the discrepancy between the fake data distribution and the ground truth distribution, thereby improving the visual fidelity and motion dynamics of the synthesized videos. Experimental results on portrait video synthesis demonstrate the effectiveness of MagicDistillation, as our method surpasses Euler, LCM, and DMD baselines in both FID/FVD metrics and VBench. Moreover, MagicDistillation, requiring only 4 steps, also outperforms WanX-I2V (14B) and HunyuanVideo-I2V (13B) on visualization and VBench. Our project page is https://magicdistillation.github.io/MagicDistillation/.

Paper Structure

This paper contains 28 sections, 1 theorem, 10 equations, 13 figures, 7 tables, 1 algorithm.

Key Result

Proposition 3.1

(the proof in Appendix apd:the_1) The optimization objective of W2S distribution matching remains identical to that of the standard distribution matching, with the low-rank branch $\zeta(\mathbf{x}_t,t)$ serving as the intermediate teacher to facilitate better optimization.

Figures (13)

  • Figure 1: Comparative visualization of synthesized videos for both real humans and characters using WanX-I2V (14B), HunyuanVideo-I2V (14B), Magic141, and MagicDistillation. Notably, WanX-I2V, HunyuanVideo-I2V, and Magic141 results are generated using 50 sampling steps, while MagicDistillation achieves comparable results with only 4 sampling steps.
  • Figure 2: Left: Vanilla DMD suffers from training collapse (4-step models with 900 iterations). Right: Our framework integrates LoRA for weak-to-strong distribution matching and $\mathcal{D}_\textrm{KL}$ constraints.
  • Figure 3: Illustration of MagicDistillation. "Reg.", "Dis.", and "Gen." stand for "Regularization", "Discriminator", and "Generator", respectively. MagicDistillation primarily leverages LoRA to facilitate the training of a large-scale VDM. The weight factors $\alpha_\textrm{strong}$ and $\alpha_\textrm{weak}$ are employed to achieve the W2S distribution matching. Furthermore, the regularization loss, which incorporates the ground truth video, helps to alleviate the overfitting problem encountered with the DMD Loss.
  • Figure 4: Ablation studies of $\alpha_\textrm{weak}$ and $\mathcal{L}_\textrm{reg}$ using 4-step models on our customized VBench. MagicDistillation reduces to the vanilla DMD2 when $\alpha_\textrm{weak}$=0. From the average metrics presented in the lower right corner, it is evident that MagicDistillation achieves its optimal performance when $\alpha_\textrm{weak}$=0.25. Furthermore, when the visual quality of the ground truth video is high but the motion dynamic is insufficient, the regularization loss (i.e., $\mathcal{L}_\textrm{reg}$) represents a trade-off between motion dynamics and visual quality.
  • Figure 5: Vanilla DMD vs. MagicDistillation. Vanilla DMD encounters a significant challenge that $p_\textrm{real}$ has no overlap with the feasible region of the sample synthesized by the few-step generator $G_\phi$, which leads to an inaccurate estimation of the gradient difference $\nabla \log p_\textrm{fake} - \nabla \log p_\textrm{real}$. In contrast, MagicDistillation mitigates this issue by subtly shifting $p_\textrm{real}$ toward $p_\textrm{fake}$. This technique adjustment ensures a more substantial overlap between the two distributions, thereby enhancing the accuracy of the gradient estimation process.
  • ...and 8 more figures

Theorems & Definitions (1)

  • Proposition 3.1