Table of Contents
Fetching ...

Motion-Guided Latent Diffusion for Temporally Consistent Real-world Video Super-resolution

Xi Yang, Chenhang He, Jianqi Ma, Lei Zhang

TL;DR

This work tackles real-world video super-resolution under unknown degradations by introducing Motion-Guided Latent Diffusion (MGLD-VSR). It couples a pre-trained latent diffusion model with a motion-guided sampling mechanism that uses optical-flow to align latent features across frames, and a temporal-aware decoder (fine-tuned with sequence-oriented losses) to stabilize details over time. Key innovations include the motion-guided diffusion sampling (MDS) and the temporal-aware sequence decoder (TSD), along with carefully designed losses that promote temporal continuity and perceptual quality. Empirical results on synthetic and real-world benchmarks demonstrate state-of-the-art perceptual quality and competitive temporal consistency, validating the effectiveness of integrating diffusion priors with motion dynamics for real-world VSR.

Abstract

Real-world low-resolution (LR) videos have diverse and complex degradations, imposing great challenges on video super-resolution (VSR) algorithms to reproduce their high-resolution (HR) counterparts with high quality. Recently, the diffusion models have shown compelling performance in generating realistic details for image restoration tasks. However, the diffusion process has randomness, making it hard to control the contents of restored images. This issue becomes more serious when applying diffusion models to VSR tasks because temporal consistency is crucial to the perceptual quality of videos. In this paper, we propose an effective real-world VSR algorithm by leveraging the strength of pre-trained latent diffusion models. To ensure the content consistency among adjacent frames, we exploit the temporal dynamics in LR videos to guide the diffusion process by optimizing the latent sampling path with a motion-guided loss, ensuring that the generated HR video maintains a coherent and continuous visual flow. To further mitigate the discontinuity of generated details, we insert temporal module to the decoder and fine-tune it with an innovative sequence-oriented loss. The proposed motion-guided latent diffusion (MGLD) based VSR algorithm achieves significantly better perceptual quality than state-of-the-arts on real-world VSR benchmark datasets, validating the effectiveness of the proposed model design and training strategies.

Motion-Guided Latent Diffusion for Temporally Consistent Real-world Video Super-resolution

TL;DR

This work tackles real-world video super-resolution under unknown degradations by introducing Motion-Guided Latent Diffusion (MGLD-VSR). It couples a pre-trained latent diffusion model with a motion-guided sampling mechanism that uses optical-flow to align latent features across frames, and a temporal-aware decoder (fine-tuned with sequence-oriented losses) to stabilize details over time. Key innovations include the motion-guided diffusion sampling (MDS) and the temporal-aware sequence decoder (TSD), along with carefully designed losses that promote temporal continuity and perceptual quality. Empirical results on synthetic and real-world benchmarks demonstrate state-of-the-art perceptual quality and competitive temporal consistency, validating the effectiveness of integrating diffusion priors with motion dynamics for real-world VSR.

Abstract

Real-world low-resolution (LR) videos have diverse and complex degradations, imposing great challenges on video super-resolution (VSR) algorithms to reproduce their high-resolution (HR) counterparts with high quality. Recently, the diffusion models have shown compelling performance in generating realistic details for image restoration tasks. However, the diffusion process has randomness, making it hard to control the contents of restored images. This issue becomes more serious when applying diffusion models to VSR tasks because temporal consistency is crucial to the perceptual quality of videos. In this paper, we propose an effective real-world VSR algorithm by leveraging the strength of pre-trained latent diffusion models. To ensure the content consistency among adjacent frames, we exploit the temporal dynamics in LR videos to guide the diffusion process by optimizing the latent sampling path with a motion-guided loss, ensuring that the generated HR video maintains a coherent and continuous visual flow. To further mitigate the discontinuity of generated details, we insert temporal module to the decoder and fine-tune it with an innovative sequence-oriented loss. The proposed motion-guided latent diffusion (MGLD) based VSR algorithm achieves significantly better perceptual quality than state-of-the-arts on real-world VSR benchmark datasets, validating the effectiveness of the proposed model design and training strategies.
Paper Structure (14 sections, 6 equations, 15 figures, 6 tables)

This paper contains 14 sections, 6 equations, 15 figures, 6 tables.

Figures (15)

  • Figure 1: The VSR results of 4 consecutive frames in the Sequence 026 from VideoLQ. The top row shows the VSR results by RealBasicVSR chan2022investigating, the middle row shows the naive VSR results by running StableSR wang2023exploiting on each frame of the sequence, and the bottom row shows the VSR results produced by our proposed MGLD-VSR method. Our method generates realistic details while achieving good temporal consistency.
  • Figure 2: Overview of the proposed MGLD-VSR framework for real-world VSR. We first estimate the forward and backward optical flow with a pre-trained optical flow estimation network and then employ a motion-guided loss to update the latent diffusion sampling process. The motion-guided loss is computed by summing the forward and backward masked warping error of the latent sequence at each time-step. After a number of $T$ sampling steps, we obtain the generated latent sequence and feed it into the temporal-aware sequence decoder to reconstruct the VSR sequence.
  • Figure 3: Qualitative comparison on synthetic datasets (REDS4, UDM10, SPMCS) for $\times 4$ VSR. (Zoom-in for best view.)
  • Figure 4: Qualitative comparison on the real-world video dataset (VideoLQ) for $\times 4$ video SR. (Zoom-in for best view.)
  • Figure 5: Temporal profiles of competing real-world VSR methods.
  • ...and 10 more figures