Table of Contents
Fetching ...

D$^2$-VR: Degradation-Robust and Distilled Video Restoration with Synergistic Optimization Strategy

Jianfeng Liang, Shaocheng Shen, Botao Xu, Qiang Hu, Xiaoyun Zhang

TL;DR

D$^2$-VR addresses the latency and temporal instability of diffusion-based video restoration under real-world degradations by introducing a degradation-robust flow alignment module, an efficient adversarial distillation pipeline to compress diffusion sampling into a rapid few-step regime, and a synergistic optimization that couples a feature-based spatial adversarial loss with Temporal-LPIPS to enforce temporal coherence. The method leverages motion-compensated conditioning from previous frames and a single-image diffusion prior to achieve high perceptual fidelity and strong temporal consistency while maintaining practicality on hardware. It achieves state-of-the-art performance on perceptual and temporal metrics with substantially faster inference (4-step sampling, roughly 12x faster than conventional diffusion), and remains adaptable to arbitrary priors. This work enables practical deployment of diffusion-based video restoration in real-world workflows such as streaming and mobile imaging.

Abstract

The integration of diffusion priors with temporal alignment has emerged as a transformative paradigm for video restoration, delivering fantastic perceptual quality, yet the practical deployment of such frameworks is severely constrained by prohibitive inference latency and temporal instability when confronted with complex real-world degradations. To address these limitations, we propose \textbf{D$^2$-VR}, a single-image diffusion-based video-restoration framework with low-step inference. To obtain precise temporal guidance under severe degradation, we first design a Degradation-Robust Flow Alignment (DRFA) module that leverages confidence-aware attention to filter unreliable motion cues. We then incorporate an adversarial distillation paradigm to compress the diffusion sampling trajectory into a rapid few-step regime. Finally, a synergistic optimization strategy is devised to harmonize perceptual quality with rigorous temporal consistency. Extensive experiments demonstrate that D$^2$-VR achieves state-of-the-art performance while accelerating the sampling process by \textbf{12$\times$}

D$^2$-VR: Degradation-Robust and Distilled Video Restoration with Synergistic Optimization Strategy

TL;DR

D-VR addresses the latency and temporal instability of diffusion-based video restoration under real-world degradations by introducing a degradation-robust flow alignment module, an efficient adversarial distillation pipeline to compress diffusion sampling into a rapid few-step regime, and a synergistic optimization that couples a feature-based spatial adversarial loss with Temporal-LPIPS to enforce temporal coherence. The method leverages motion-compensated conditioning from previous frames and a single-image diffusion prior to achieve high perceptual fidelity and strong temporal consistency while maintaining practicality on hardware. It achieves state-of-the-art performance on perceptual and temporal metrics with substantially faster inference (4-step sampling, roughly 12x faster than conventional diffusion), and remains adaptable to arbitrary priors. This work enables practical deployment of diffusion-based video restoration in real-world workflows such as streaming and mobile imaging.

Abstract

The integration of diffusion priors with temporal alignment has emerged as a transformative paradigm for video restoration, delivering fantastic perceptual quality, yet the practical deployment of such frameworks is severely constrained by prohibitive inference latency and temporal instability when confronted with complex real-world degradations. To address these limitations, we propose \textbf{D-VR}, a single-image diffusion-based video-restoration framework with low-step inference. To obtain precise temporal guidance under severe degradation, we first design a Degradation-Robust Flow Alignment (DRFA) module that leverages confidence-aware attention to filter unreliable motion cues. We then incorporate an adversarial distillation paradigm to compress the diffusion sampling trajectory into a rapid few-step regime. Finally, a synergistic optimization strategy is devised to harmonize perceptual quality with rigorous temporal consistency. Extensive experiments demonstrate that D-VR achieves state-of-the-art performance while accelerating the sampling process by \textbf{12}
Paper Structure (11 sections, 6 equations, 2 figures, 3 tables)

This paper contains 11 sections, 6 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: D$^2$VR Overview. (a) Training and inference pipeline with the computation of the three loss terms. (b) DRFA module architecture. (c) Perceptual quality–speed trade-off comparison between our method and existing approaches, where the x-axis represents inference speed (FPS) and the y-axis represents perceptual quality (CLIP-IQA).
  • Figure 2: Qualitative comparisons results on benchmarks with existing methods.