Table of Contents
Fetching ...

One-Step Diffusion for Detail-Rich and Temporally Consistent Video Super-Resolution

Yujing Sun, Lingchen Sun, Shuaizheng Liu, Rongyuan Wu, Zhengqiang Zhang, Lei Zhang

TL;DR

Real-world video super-resolution must restore rich spatial detail while preserving temporal coherence under unknown degradations. The authors present DLoRAL, a dual-branch diffusion framework that decouples temporal consistency (C-LoRA) from detail synthesis (D-LoRA) and uses a Cross-Frame Retrieval module to derive degradation-robust temporal priors, all within a one-step diffusion process. Training alternates between consistency-focused and detail-enhancement phases with smooth loss transitions, and inference merges the LoRA branches into the diffusion UNet for fast one-shot restoration. Experiments show state-of-the-art perceptual quality and temporal stability on Real-VSR benchmarks with substantial speedups over prior diffusion-based methods, indicating practical impact for real-world video restoration.

Abstract

It is a challenging problem to reproduce rich spatial details while maintaining temporal consistency in real-world video super-resolution (Real-VSR), especially when we leverage pre-trained generative models such as stable diffusion (SD) for realistic details synthesis. Existing SD-based Real-VSR methods often compromise spatial details for temporal coherence, resulting in suboptimal visual quality. We argue that the key lies in how to effectively extract the degradation-robust temporal consistency priors from the low-quality (LQ) input video and enhance the video details while maintaining the extracted consistency priors. To achieve this, we propose a Dual LoRA Learning (DLoRAL) paradigm to train an effective SD-based one-step diffusion model, achieving realistic frame details and temporal consistency simultaneously. Specifically, we introduce a Cross-Frame Retrieval (CFR) module to aggregate complementary information across frames, and train a Consistency-LoRA (C-LoRA) to learn robust temporal representations from degraded inputs. After consistency learning, we fix the CFR and C-LoRA modules and train a Detail-LoRA (D-LoRA) to enhance spatial details while aligning with the temporal space defined by C-LoRA to keep temporal coherence. The two phases alternate iteratively for optimization, collaboratively delivering consistent and detail-rich outputs. During inference, the two LoRA branches are merged into the SD model, allowing efficient and high-quality video restoration in a single diffusion step. Experiments show that DLoRAL achieves strong performance in both accuracy and speed. Code and models are available at https://github.com/yjsunnn/DLoRAL.

One-Step Diffusion for Detail-Rich and Temporally Consistent Video Super-Resolution

TL;DR

Real-world video super-resolution must restore rich spatial detail while preserving temporal coherence under unknown degradations. The authors present DLoRAL, a dual-branch diffusion framework that decouples temporal consistency (C-LoRA) from detail synthesis (D-LoRA) and uses a Cross-Frame Retrieval module to derive degradation-robust temporal priors, all within a one-step diffusion process. Training alternates between consistency-focused and detail-enhancement phases with smooth loss transitions, and inference merges the LoRA branches into the diffusion UNet for fast one-shot restoration. Experiments show state-of-the-art perceptual quality and temporal stability on Real-VSR benchmarks with substantial speedups over prior diffusion-based methods, indicating practical impact for real-world video restoration.

Abstract

It is a challenging problem to reproduce rich spatial details while maintaining temporal consistency in real-world video super-resolution (Real-VSR), especially when we leverage pre-trained generative models such as stable diffusion (SD) for realistic details synthesis. Existing SD-based Real-VSR methods often compromise spatial details for temporal coherence, resulting in suboptimal visual quality. We argue that the key lies in how to effectively extract the degradation-robust temporal consistency priors from the low-quality (LQ) input video and enhance the video details while maintaining the extracted consistency priors. To achieve this, we propose a Dual LoRA Learning (DLoRAL) paradigm to train an effective SD-based one-step diffusion model, achieving realistic frame details and temporal consistency simultaneously. Specifically, we introduce a Cross-Frame Retrieval (CFR) module to aggregate complementary information across frames, and train a Consistency-LoRA (C-LoRA) to learn robust temporal representations from degraded inputs. After consistency learning, we fix the CFR and C-LoRA modules and train a Detail-LoRA (D-LoRA) to enhance spatial details while aligning with the temporal space defined by C-LoRA to keep temporal coherence. The two phases alternate iteratively for optimization, collaboratively delivering consistent and detail-rich outputs. During inference, the two LoRA branches are merged into the SD model, allowing efficient and high-quality video restoration in a single diffusion step. Experiments show that DLoRAL achieves strong performance in both accuracy and speed. Code and models are available at https://github.com/yjsunnn/DLoRAL.

Paper Structure

This paper contains 16 sections, 6 equations, 8 figures, 6 tables, 1 algorithm.

Figures (8)

  • Figure 1: Quality and efficiency comparison among SD-based Real-VSR methods. (a) Quality comparison on the VideoLQ benchmark chan2022realbasicvsr. (b) Efficiency comparison tested on an A100 GPU ($512 \times 512$ input with 50 frames for $\times 4$ VSR). DLoRAL achieves the best perceptual quality with only one diffusion step, about 10$\times$ faster than Upscale-A-Video zhou2024upscale, MGLD yang2024motion, and STAR xie2025star.
  • Figure 2: The training pipeline of our proposed DLoRAL. The Cross-Frame Retrieval (CFR) and Consistency-LoRA (C-LoRA) modules are optimized in the consistency stage, while the Detail-LoRA (D-LoRA) is optimized in the enhancement stage. Both stages are alternately trained to ensure temporal coherence and visual quality.
  • Figure 3: Qualitative comparison of VSR models on real-world VideoLQ dataset.
  • Figure 4: Temporal profiles of competing Real-ISR and Real-VSR methods.
  • Figure 5: LQ videos used in our user study and the voting results.
  • ...and 3 more figures