Table of Contents
Fetching ...

SeedVR2: One-Step Video Restoration via Diffusion Adversarial Post-Training

Jianyi Wang, Shanchuan Lin, Zhijie Lin, Yuxi Ren, Meng Wei, Zongsheng Yue, Shangchen Zhou, Hao Chen, Yang Zhao, Ceyuan Yang, Xuefeng Xiao, Chen Change Loy, Lu Jiang

TL;DR

SeedVR2 tackles high-resolution one-step video restoration by fine-tuning a diffusion-transformer through adversarial post-training initialized from SeedVR. It introduces an adaptive window attention mechanism, progressive distillation, RpGAN, approximate R2 regularization, and a discriminator-based feature matching loss to stabilize training and enhance perceptual fidelity. Across synthetic, real-world, and AIGC datasets, SeedVR2 achieves competitive or superior results with a single sampling step and markedly reduced latency compared to multi-step diffusion VR. The approach demonstrates the practicality of fast, high-quality VR in real-world scenarios while acknowledging limitations in encoding/decoding latency and robustness under heavy degradations.

Abstract

Recent advances in diffusion-based video restoration (VR) demonstrate significant improvement in visual quality, yet yield a prohibitive computational cost during inference. While several distillation-based approaches have exhibited the potential of one-step image restoration, extending existing approaches to VR remains challenging and underexplored, particularly when dealing with high-resolution video in real-world settings. In this work, we propose a one-step diffusion-based VR model, termed as SeedVR2, which performs adversarial VR training against real data. To handle the challenging high-resolution VR within a single step, we introduce several enhancements to both model architecture and training procedures. Specifically, an adaptive window attention mechanism is proposed, where the window size is dynamically adjusted to fit the output resolutions, avoiding window inconsistency observed under high-resolution VR using window attention with a predefined window size. To stabilize and improve the adversarial post-training towards VR, we further verify the effectiveness of a series of losses, including a proposed feature matching loss without significantly sacrificing training efficiency. Extensive experiments show that SeedVR2 can achieve comparable or even better performance compared with existing VR approaches in a single step.

SeedVR2: One-Step Video Restoration via Diffusion Adversarial Post-Training

TL;DR

SeedVR2 tackles high-resolution one-step video restoration by fine-tuning a diffusion-transformer through adversarial post-training initialized from SeedVR. It introduces an adaptive window attention mechanism, progressive distillation, RpGAN, approximate R2 regularization, and a discriminator-based feature matching loss to stabilize training and enhance perceptual fidelity. Across synthetic, real-world, and AIGC datasets, SeedVR2 achieves competitive or superior results with a single sampling step and markedly reduced latency compared to multi-step diffusion VR. The approach demonstrates the practicality of fast, high-quality VR in real-world scenarios while acknowledging limitations in encoding/decoding latency and robustness under heavy degradations.

Abstract

Recent advances in diffusion-based video restoration (VR) demonstrate significant improvement in visual quality, yet yield a prohibitive computational cost during inference. While several distillation-based approaches have exhibited the potential of one-step image restoration, extending existing approaches to VR remains challenging and underexplored, particularly when dealing with high-resolution video in real-world settings. In this work, we propose a one-step diffusion-based VR model, termed as SeedVR2, which performs adversarial VR training against real data. To handle the challenging high-resolution VR within a single step, we introduce several enhancements to both model architecture and training procedures. Specifically, an adaptive window attention mechanism is proposed, where the window size is dynamically adjusted to fit the output resolutions, avoiding window inconsistency observed under high-resolution VR using window attention with a predefined window size. To stabilize and improve the adversarial post-training towards VR, we further verify the effectiveness of a series of losses, including a proposed feature matching loss without significantly sacrificing training efficiency. Extensive experiments show that SeedVR2 can achieve comparable or even better performance compared with existing VR approaches in a single step.

Paper Structure

This paper contains 14 sections, 4 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Speed and performance comparisons. Our SeedVR2 demonstrates impressive restoration capabilities, offering fine details and enhanced visual realism. While achieving comparable performance with SeedVR wang2025seedvr, our SeedVR2 is over $4 \times$ faster than existing diffusion-based video restoration approaches zhou2024upscaleavideoyang2023mgldvsrhe2024venhancerxie2025star (We use 50 sampling steps for these baselines to maintain stable performance), even with at least four times the parameter count (Zoom-in for best view).
  • Figure 2: Model architecture and the partition of the adaptive attention window. We improve the Swin-MMDIT wang2025seedvr with an adaptive window partition, i.e., the window size is ensured via a $3 \times 3$ partition on the resized LQ input ($\rm {Height} \times \rm {Width} = 960 \times 960$). The features for calculating the feature matching loss are extracted before the cross-attention layers used in APT lin2025diffusion.
  • Figure 3: Qualitative comparisons on both real-world chan2022investigating and AIGC videos. With a single sampling step, our SeedVR2 achieves comparable performance to SeedVR wang2025seedvr, and further excels other baselines with superior restoration capabilities, i.e., successfully removing the degradations while maintaining the textures of the bird, text, building, and the dog's face (Zoom-in for best view).
  • Figure 4: Comparisons of the window attention with a predefined size (i.e., ours w/ predefined win. atten.) and our adaptive window attention (i.e., ours w/ adaptive win. atten.). Boundary artifacts can be observed on high-resolution restoration with the predefined-size window attention (Zoom-in for best view).
  • Figure 5: Qualitative comparisons on both real-world chan2022investigating and AIGC videos. It is noticeable that the GAN-based approach zhang2024realviformer generates blurry results due to limited generation ability. Previous multi-step diffusion-based VR he2024venhancerzhou2024upscaleavideoyang2023mgldvsrxie2025star either fail to restore the low-quality video with faithful details or tend to generate oversharpened details. Even with a single sampling step, our approach clearly excels over these methods with a large margin. (Zoom-in for best view).