Table of Contents
Fetching ...

Improved Adversarial Diffusion Compression for Real-World Video Super-Resolution

Bin Chen, Weiqi Li, Shijie Zhao, Xuanyu Zhang, Junlin Li, Li Zhang, Jian Zhang

TL;DR

This work proposes an improved ADC method for Real-VSR, and introduces a dual-head adversarial distillation scheme, in which discriminators in both pixel and feature domains explicitly disentangle the discrimination of details and consistency into two heads, enabling both objectives to be effectively optimized without sacrificing one for the other.

Abstract

While many diffusion models have achieved impressive results in real-world video super-resolution (Real-VSR) by generating rich and realistic details, their reliance on multi-step sampling leads to slow inference. One-step networks like SeedVR2, DOVE, and DLoRAL alleviate this through condensing generation into one single step, yet they remain heavy, with billions of parameters and multi-second latency. Recent adversarial diffusion compression (ADC) offers a promising path via pruning and distilling these models into a compact AdcSR network, but directly applying it to Real-VSR fails to balance spatial details and temporal consistency due to its lack of temporal awareness and the limitations of standard adversarial learning. To address these challenges, we propose an improved ADC method for Real-VSR. Our approach distills a large diffusion Transformer (DiT) teacher DOVE equipped with 3D spatio-temporal attentions, into a pruned 2D Stable Diffusion (SD)-based AdcSR backbone, augmented with lightweight 1D temporal convolutions, achieving significantly higher efficiency. In addition, we introduce a dual-head adversarial distillation scheme, in which discriminators in both pixel and feature domains explicitly disentangle the discrimination of details and consistency into two heads, enabling both objectives to be effectively optimized without sacrificing one for the other. Experiments demonstrate that the resulting compressed AdcVSR model reduces complexity by 95% in parameters and achieves an 8$\times$ acceleration over its DiT teacher DOVE, while maintaining competitive video quality and efficiency.

Improved Adversarial Diffusion Compression for Real-World Video Super-Resolution

TL;DR

This work proposes an improved ADC method for Real-VSR, and introduces a dual-head adversarial distillation scheme, in which discriminators in both pixel and feature domains explicitly disentangle the discrimination of details and consistency into two heads, enabling both objectives to be effectively optimized without sacrificing one for the other.

Abstract

While many diffusion models have achieved impressive results in real-world video super-resolution (Real-VSR) by generating rich and realistic details, their reliance on multi-step sampling leads to slow inference. One-step networks like SeedVR2, DOVE, and DLoRAL alleviate this through condensing generation into one single step, yet they remain heavy, with billions of parameters and multi-second latency. Recent adversarial diffusion compression (ADC) offers a promising path via pruning and distilling these models into a compact AdcSR network, but directly applying it to Real-VSR fails to balance spatial details and temporal consistency due to its lack of temporal awareness and the limitations of standard adversarial learning. To address these challenges, we propose an improved ADC method for Real-VSR. Our approach distills a large diffusion Transformer (DiT) teacher DOVE equipped with 3D spatio-temporal attentions, into a pruned 2D Stable Diffusion (SD)-based AdcSR backbone, augmented with lightweight 1D temporal convolutions, achieving significantly higher efficiency. In addition, we introduce a dual-head adversarial distillation scheme, in which discriminators in both pixel and feature domains explicitly disentangle the discrimination of details and consistency into two heads, enabling both objectives to be effectively optimized without sacrificing one for the other. Experiments demonstrate that the resulting compressed AdcVSR model reduces complexity by 95% in parameters and achieves an 8 acceleration over its DiT teacher DOVE, while maintaining competitive video quality and efficiency.
Paper Structure (16 sections, 2 equations, 9 figures, 8 tables, 1 algorithm)

This paper contains 16 sections, 2 equations, 9 figures, 8 tables, 1 algorithm.

Figures (9)

  • Figure 1: Comparison of methods in compressing diffusion networks for Real-VSR.(a) Traditional ADC chen2025adversarial distills an SD network with 2D spatial attentions into a pruned student using a single adversarial signal without temporal modeling, suffering from frame flickering. (b) Our improved ADC distills a larger DiT-based teacher with heavier 3D spatio-temporal attention into the same 2D student, augmented by 1D temporal convolutions, using dual-head discriminators ${\mathcal{D}}$ in pixel and feature domains. Through disentangling the discriminations of detail richness and temporal consistency into different heads, it balances the optimization of both.
  • Figure 2: Illustration of the proposed improved ADC method, with application to compressing DOVE (teacher) into AdcVSR model (student).(a) We augment the 2D AdcSR Real-ISR network chen2025adversarial, consisting of pruned SD2.1 UNet and VAE decoder, through inserting 1D temporal residual blocks (RBs) after each 2D spatial RB and Transformer block (TB), enabling temporal modeling, while maintaining efficiency for Real-VSR. The resulting AdcVSR network is then fully trained end-to-end via adversarial distillation from DOVE. (b) For adversarial learning, we design dual-head discriminators with pretrained backbones for feature extractions. Each discriminator uses 2D and 1D convolutions, followed by two linear projection heads at the tail, to separately evaluate detail richness and temporal consistency. Training is guided by five curated types of video and image data (1-5) with head-specific labels, achieving a balanced optimization of details and consistency.
  • Figure 2: Comparison of network designs on UDM10.
  • Figure 3: Qualitative comparison of Real-VSR performance on 13th frames of two videos: "028" from RealVSR (top) and "016" from VideoLQ (bottom). Temporal profiles are provided below each frame, obtained by taking slices along the width-temporal plane at the vertical centers of the frames.
  • Figure 4: Performance comparison among diffusion-based Real-VSR methods in temporal consistency and complexity (parameter number and inference time) (see Tab. \ref{['tab:comp_main']}). AdcVSR attains the lowest warping error $E_\text{warp}^{*}$, the second-lowest parameter number, and the second-highest inference speed. Bubble colors represent method types: green for multi-step, blue for one-step, and red for AdcVSR.
  • ...and 4 more figures