Table of Contents
Fetching ...

Phased One-Step Adversarial Equilibrium for Video Diffusion Models

Jiaxiang Cheng, Bing Ma, Xuhua Ren, Hongyi Henry Jin, Kai Yu, Peng Zhang, Wenyue Li, Yuan Zhou, Tianxiang Zheng, Qinglin Lu

TL;DR

The paper tackles the bottleneck of sampling efficiency in video diffusion models, especially for large-scale architectures and long temporal contexts. It presents Video Phased Adversarial Equilibrium (V-PAE), a two-phase distillation framework comprising stability priming to align real and generated video distributions and a unified adversarial equilibrium with a self-discriminator backbone, augmented by a semantic discriminator head and a conditional SDS loss to preserve video-image subject consistency. Empirical results on VBench-I2V show V-PAE achieving an average 5.8% improvement in overall quality and a 100x diffusion latency reduction, with near or better performance in few-step or zero-shot scenarios. The approach delivers practical real-time, high-fidelity video synthesis for interactive applications and outlines extensive ablations validating its components and design choices.

Abstract

Video diffusion generation suffers from critical sampling efficiency bottlenecks, particularly for large-scale models and long contexts. Existing video acceleration methods, adapted from image-based techniques, lack a single-step distillation ability for large-scale video models and task generalization for conditional downstream tasks. To bridge this gap, we propose the Video Phased Adversarial Equilibrium (V-PAE), a distillation framework that enables high-quality, single-step video generation from large-scale video models. Our approach employs a two-phase process. (i) Stability priming is a warm-up process to align the distributions of real and generated videos. It improves the stability of single-step adversarial distillation in the following process. (ii) Unified adversarial equilibrium is a flexible self-adversarial process that reuses generator parameters for the discriminator backbone. It achieves a co-evolutionary adversarial equilibrium in the Gaussian noise space. For the conditional tasks, we primarily preserve video-image subject consistency, which is caused by semantic degradation and conditional frame collapse during the distillation training in image-to-video (I2V) generation. Comprehensive experiments on VBench-I2V demonstrate that V-PAE outperforms existing acceleration methods by an average of 5.8% in the overall quality score, including semantic alignment, temporal coherence, and frame quality. In addition, our approach reduces the diffusion latency of the large-scale video model (e.g., Wan2.1-I2V-14B) by 100 times, while preserving competitive performance.

Phased One-Step Adversarial Equilibrium for Video Diffusion Models

TL;DR

The paper tackles the bottleneck of sampling efficiency in video diffusion models, especially for large-scale architectures and long temporal contexts. It presents Video Phased Adversarial Equilibrium (V-PAE), a two-phase distillation framework comprising stability priming to align real and generated video distributions and a unified adversarial equilibrium with a self-discriminator backbone, augmented by a semantic discriminator head and a conditional SDS loss to preserve video-image subject consistency. Empirical results on VBench-I2V show V-PAE achieving an average 5.8% improvement in overall quality and a 100x diffusion latency reduction, with near or better performance in few-step or zero-shot scenarios. The approach delivers practical real-time, high-fidelity video synthesis for interactive applications and outlines extensive ablations validating its components and design choices.

Abstract

Video diffusion generation suffers from critical sampling efficiency bottlenecks, particularly for large-scale models and long contexts. Existing video acceleration methods, adapted from image-based techniques, lack a single-step distillation ability for large-scale video models and task generalization for conditional downstream tasks. To bridge this gap, we propose the Video Phased Adversarial Equilibrium (V-PAE), a distillation framework that enables high-quality, single-step video generation from large-scale video models. Our approach employs a two-phase process. (i) Stability priming is a warm-up process to align the distributions of real and generated videos. It improves the stability of single-step adversarial distillation in the following process. (ii) Unified adversarial equilibrium is a flexible self-adversarial process that reuses generator parameters for the discriminator backbone. It achieves a co-evolutionary adversarial equilibrium in the Gaussian noise space. For the conditional tasks, we primarily preserve video-image subject consistency, which is caused by semantic degradation and conditional frame collapse during the distillation training in image-to-video (I2V) generation. Comprehensive experiments on VBench-I2V demonstrate that V-PAE outperforms existing acceleration methods by an average of 5.8% in the overall quality score, including semantic alignment, temporal coherence, and frame quality. In addition, our approach reduces the diffusion latency of the large-scale video model (e.g., Wan2.1-I2V-14B) by 100 times, while preserving competitive performance.

Paper Structure

This paper contains 44 sections, 11 equations, 17 figures, 4 tables.

Figures (17)

  • Figure 1: Comparison between V-PAE and existing acceleration methods on VBench-I2V. It includes three distillation paradigms: (i) Consistency Distillation (CD), (ii) Variational Score Distillation (VSD) and (iii) Adversarial Distillation (AD). For fairness, all models are distilled from Wan2.1-I2V-14B wan2025wan using the same dataset and training cost. Diffusion latency is measured for 5-second $720 \times 1280$ videos on $8 \times$ H20 GPUs.
  • Figure 2: Overview.V-PAE first aligns the distributions of generated and real videos in the (a) stability priming process. Building on this process, it reuses the generator parameters for the discriminator backbone, which achieves a co-evolutionary adversarial training in the (b) unified adversarial equilibrium process. For the conditional generation, we also provide the conditional SDS loss and semantic discriminator to (c) preserve video-image subject consistency.
  • Figure 3: The semantic discriminator head architecture.
  • Figure 4: Qualitative comparison with Wan2.1-I2V-14B. We compare our method against the baseline using both 100-NFE and 1-NFE sampling. For 100-NFE, videos are generated with 50 denoising steps and a guidance scale of 5.0.
  • Figure 5: Qualitative results of 1-NFE between V-PAE and existing acceleration distillation methods. We evaluate against representative methods from three paradigms, including DMD2 yin2024improved from VSD, PCM wang2024phased from CD, and APT lin2025diffusion from AD.
  • ...and 12 more figures