Phased One-Step Adversarial Equilibrium for Video Diffusion Models
Jiaxiang Cheng, Bing Ma, Xuhua Ren, Hongyi Henry Jin, Kai Yu, Peng Zhang, Wenyue Li, Yuan Zhou, Tianxiang Zheng, Qinglin Lu
TL;DR
The paper tackles the bottleneck of sampling efficiency in video diffusion models, especially for large-scale architectures and long temporal contexts. It presents Video Phased Adversarial Equilibrium (V-PAE), a two-phase distillation framework comprising stability priming to align real and generated video distributions and a unified adversarial equilibrium with a self-discriminator backbone, augmented by a semantic discriminator head and a conditional SDS loss to preserve video-image subject consistency. Empirical results on VBench-I2V show V-PAE achieving an average 5.8% improvement in overall quality and a 100x diffusion latency reduction, with near or better performance in few-step or zero-shot scenarios. The approach delivers practical real-time, high-fidelity video synthesis for interactive applications and outlines extensive ablations validating its components and design choices.
Abstract
Video diffusion generation suffers from critical sampling efficiency bottlenecks, particularly for large-scale models and long contexts. Existing video acceleration methods, adapted from image-based techniques, lack a single-step distillation ability for large-scale video models and task generalization for conditional downstream tasks. To bridge this gap, we propose the Video Phased Adversarial Equilibrium (V-PAE), a distillation framework that enables high-quality, single-step video generation from large-scale video models. Our approach employs a two-phase process. (i) Stability priming is a warm-up process to align the distributions of real and generated videos. It improves the stability of single-step adversarial distillation in the following process. (ii) Unified adversarial equilibrium is a flexible self-adversarial process that reuses generator parameters for the discriminator backbone. It achieves a co-evolutionary adversarial equilibrium in the Gaussian noise space. For the conditional tasks, we primarily preserve video-image subject consistency, which is caused by semantic degradation and conditional frame collapse during the distillation training in image-to-video (I2V) generation. Comprehensive experiments on VBench-I2V demonstrate that V-PAE outperforms existing acceleration methods by an average of 5.8% in the overall quality score, including semantic alignment, temporal coherence, and frame quality. In addition, our approach reduces the diffusion latency of the large-scale video model (e.g., Wan2.1-I2V-14B) by 100 times, while preserving competitive performance.
