MagicDistillation: Weak-to-Strong Video Distillation for Large-Scale Few-Step Synthesis

Shitong Shao; Hongwei Yi; Hanzhong Guo; Tian Ye; Daquan Zhou; Michael Lingelbach; Zhiqiang Xu; Zeke Xie

MagicDistillation: Weak-to-Strong Video Distillation for Large-Scale Few-Step Synthesis

Shitong Shao, Hongwei Yi, Hanzhong Guo, Tian Ye, Daquan Zhou, Michael Lingelbach, Zhiqiang Xu, Zeke Xie

TL;DR

MagicDistillation tackles the high inference cost and portrait fidelity gaps of large-scale video diffusion models by integrating weak-to-strong distribution matching with LoRA-based fine-tuning within a distribution distillation framework. It introduces ground-truth supervision to stabilize training and improve visual fidelity, enabling high-quality portrait video synthesis with only 4 steps (NFE=4) while surpassing several strong baselines. Empirical results across portrait benchmarks and TI2V/I2V tasks demonstrate improved FID/FVD and motion dynamics, showing practical potential for real-time or near-real-time portrait video generation. The approach combines a DiT-based discriminator, a 4-step distilled generator, and data from high-quality talking and general videos, delivering robust performance with scalable efficiency.

Abstract

Recently, open-source video diffusion models (VDMs), such as WanX, Magic141 and HunyuanVideo, have been scaled to over 10 billion parameters. These large-scale VDMs have demonstrated significant improvements over smaller-scale VDMs across multiple dimensions, including enhanced visual quality and more natural motion dynamics. However, these models face two major limitations: (1) High inference overhead: Large-scale VDMs require approximately 10 minutes to synthesize a 28-step video on a single H100 GPU. (2) Limited in portrait video synthesis: Models like WanX-I2V and HunyuanVideo-I2V often produce unnatural facial expressions and movements in portrait videos. To address these challenges, we propose MagicDistillation, a novel framework designed to reduce inference overhead while ensuring the generalization of VDMs for portrait video synthesis. Specifically, we primarily use sufficiently high-quality talking video to fine-tune Magic141, which is dedicated to portrait video synthesis. We then employ LoRA to effectively and efficiently fine-tune the fake DiT within the step distillation framework known as distribution matching distillation (DMD). Following this, we apply weak-to-strong (W2S) distribution matching and minimize the discrepancy between the fake data distribution and the ground truth distribution, thereby improving the visual fidelity and motion dynamics of the synthesized videos. Experimental results on portrait video synthesis demonstrate the effectiveness of MagicDistillation, as our method surpasses Euler, LCM, and DMD baselines in both FID/FVD metrics and VBench. Moreover, MagicDistillation, requiring only 4 steps, also outperforms WanX-I2V (14B) and HunyuanVideo-I2V (13B) on visualization and VBench. Our project page is https://magicdistillation.github.io/MagicDistillation/.

MagicDistillation: Weak-to-Strong Video Distillation for Large-Scale Few-Step Synthesis

TL;DR

Abstract

MagicDistillation: Weak-to-Strong Video Distillation for Large-Scale Few-Step Synthesis

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (13)

Theorems & Definitions (1)