DiffVSR: Revealing an Effective Recipe for Taming Robust Video Super-Resolution Against Complex Degradations

Xiaohui Li; Yihao Liu; Shuo Cao; Ziyan Chen; Shaobin Zhuang; Xiangyu Chen; Yinan He; Yi Wang; Yu Qiao

DiffVSR: Revealing an Effective Recipe for Taming Robust Video Super-Resolution Against Complex Degradations

Xiaohui Li, Yihao Liu, Shuo Cao, Ziyan Chen, Shaobin Zhuang, Xiangyu Chen, Yinan He, Yi Wang, Yu Qiao

TL;DR

DiffVSR tackles the weakness of diffusion-based video restoration under complex real-world degradations by shifting focus from architectural complexity to learning strategy. The core contributions are a Progressive Learning Strategy (PLS) that decomposes degradation, data quality, and optimization into stages, and an Interweaved Latent Transition (ILT) that preserves temporal coherence without extra training. Architectural components such as Multi-Scale Temporal Attention (MSTA) and Temporal-Enhanced 3D VAE (TE-3DVAE) complement the learning strategy, culminating in robust 4× VSR on severely degraded videos and strong performance on real-world data, with favorable perceptual and temporal metrics and supportive user studies. This work reframes diffusion-based video restoration, showing that appropriately staged learning can unlock latent capabilities far beyond architectural tinkering, with practical implications for real-world video enhancement and related diffusion tasks.

Abstract

Diffusion models have demonstrated exceptional capabilities in image restoration, yet their application to video super-resolution (VSR) faces significant challenges in balancing fidelity with temporal consistency. Our evaluation reveals a critical gap: existing approaches consistently fail on severely degraded videos--precisely where diffusion models' generative capabilities are most needed. We identify that existing diffusion-based VSR methods struggle primarily because they face an overwhelming learning burden: simultaneously modeling complex degradation distributions, content representations, and temporal relationships with limited high-quality training data. To address this fundamental challenge, we present DiffVSR, featuring a Progressive Learning Strategy (PLS) that systematically decomposes this learning burden through staged training, enabling superior performance on complex degradations. Our framework additionally incorporates an Interweaved Latent Transition (ILT) technique that maintains competitive temporal consistency without additional training overhead. Experiments demonstrate that our approach excels in scenarios where competing methods struggle, particularly on severely degraded videos. Our work reveals that addressing the learning strategy, rather than focusing solely on architectural complexity, is the critical path toward robust real-world video super-resolution with diffusion models.

DiffVSR: Revealing an Effective Recipe for Taming Robust Video Super-Resolution Against Complex Degradations

TL;DR

Abstract

Paper Structure (24 sections, 4 equations, 12 figures, 7 tables)

This paper contains 24 sections, 4 equations, 12 figures, 7 tables.

Introduction
Related Work
Method
Preliminary: Generative Diffusion Prior
Progressive Learning Strategy
Interweaved Latent Transition
Architectural Components
Experiments
Datasets and Implementation Details
Ablation Study
Comparisons
Conclusion
More Ablation Studies
Runtime Analysis for ILT and its alternatives
Effectiveness of Temporal-Enhanced 3DVAE.
...and 9 more sections

Figures (12)

Figure 1: Motivation: Limitations of Existing VSR Methods on Complex Degradations. As degradation complexity increases, state-of-the-art methods demonstrate significant performance drop—either producing over-smoothed results (oil painting effect) or failing to remove complex artifacts. This limitation persists across different architectural designs, revealing that architectural innovation alone is insufficient for handling complex real-world degradations. Our work addresses this fundamental challenge. (Zoom in for best view)
Figure 2: Overview of our proposed DiffVSR framework. (a) Model architecture with enhanced UNet and VAE. (b) Architectural improvements for feature extraction and reconstruction. (c) Progressive Learning Strategy (PLS), our core innovation for handling complex degradations. (d) Multi-Scale Temporal Attention (MSTA) for capturing temporal dependencies at different scales.
Figure 3: Interweaved Latent Transition approach illustrated. By combining strategic noise rescheduling across overlapping regions with position-based latent interpolation between adjacent subsequences, this lightweight solution ensures temporal consistency without requiring additional training or computational resources.
Figure 4: Illustration of three training strategy variants: direct training, partial progressive learning, and our full PLS approach.
Figure 5: Visual comparison on synthetic (top) and real-world (bottom) degraded videos. Our method achieves clear and natural restoration of text, building texture details, human hair, and animal fur under severe degradation, while competing methods either fail to remove artifacts or generate unnatural oil-painting-like details. (Zoom-in for best view)
...and 7 more figures

DiffVSR: Revealing an Effective Recipe for Taming Robust Video Super-Resolution Against Complex Degradations

TL;DR

Abstract

DiffVSR: Revealing an Effective Recipe for Taming Robust Video Super-Resolution Against Complex Degradations

Authors

TL;DR

Abstract

Table of Contents

Figures (12)