Enhancing Perceptual Quality in Video Super-Resolution through Temporally-Consistent Detail Synthesis using Diffusion Models

Claudio Rota; Marco Buzzelli; Joost van de Weijer

Enhancing Perceptual Quality in Video Super-Resolution through Temporally-Consistent Detail Synthesis using Diffusion Models

Claudio Rota, Marco Buzzelli, Joost van de Weijer

TL;DR

This work tackles perceptual quality in video super-resolution by leveraging diffusion models to synthesize realistic, temporally-consistent details. It introduces StableVSR, which adapts a pre-trained SISR latent diffusion model through a Temporal Conditioning Module, guided by Temporal Texture Guidance from adjacent frames. A Frame-wise Bidirectional Sampling strategy further stabilizes temporal coherence by balancing information flow across time. Across Vimeo-90K and REDS4, StableVSR delivers improved perceptual metrics and temporal consistency, albeit with higher computational cost, highlighting a meaningful shift toward perceptual-focused VSR with diffusion-based generation.

Abstract

In this paper, we address the problem of enhancing perceptual quality in video super-resolution (VSR) using Diffusion Models (DMs) while ensuring temporal consistency among frames. We present StableVSR, a VSR method based on DMs that can significantly enhance the perceptual quality of upscaled videos by synthesizing realistic and temporally-consistent details. We introduce the Temporal Conditioning Module (TCM) into a pre-trained DM for single image super-resolution to turn it into a VSR method. TCM uses the novel Temporal Texture Guidance, which provides it with spatially-aligned and detail-rich texture information synthesized in adjacent frames. This guides the generative process of the current frame toward high-quality and temporally-consistent results. In addition, we introduce the novel Frame-wise Bidirectional Sampling strategy to encourage the use of information from past to future and vice-versa. This strategy improves the perceptual quality of the results and the temporal consistency across frames. We demonstrate the effectiveness of StableVSR in enhancing the perceptual quality of upscaled videos while achieving better temporal consistency compared to existing state-of-the-art methods for VSR. The project page is available at https://github.com/claudiom4sir/StableVSR.

Enhancing Perceptual Quality in Video Super-Resolution through Temporally-Consistent Detail Synthesis using Diffusion Models

TL;DR

Abstract

Paper Structure (24 sections, 8 equations, 12 figures, 5 tables, 2 algorithms)

This paper contains 24 sections, 8 equations, 12 figures, 5 tables, 2 algorithms.

Introduction
Related work
Background on Diffusion Models
Methodology
Temporal Conditioning Module
Temporal Texture Guidance
Frame-wise Bidirectional Sampling strategy
Training procedure
Experiments
Implementation details
Datasets and evaluation metrics
Comparison with state-of-the-art methods
Ablation study
Discussion and limitations
Conclusion
...and 9 more sections

Figures (12)

Figure 1: Graphical representation of the proposed Frame-wise Bidirectional Sampling strategy. The green flow propagates information forward in sampling time while the blue flow alternately propagates it forward and backward in video time. Forward propagation is shown with dashed lines, while backward propagation with dotted lines.
Figure 2: Overview of the proposed StableVSR. We use the Temporal Conditioning Module (Section \ref{['subsec:tcm']}) to turn a single image super-resolution LDM (denoising UNet) into a video super-resolution method. TCM exploits the novel Temporal Texture Guidance (Section \ref{['subsec:ttg']}), which provides TCM with spatially-aligned and detail-rich texture information synthesized in adjacent frames. The sampling step is taken using the novel Frame-wise Bidirectional Sampling strategy (Section \ref{['subsec:bss']}). $\mathcal{D}$ represents the VAE decoder. Green lines refer to progression in sampling time, while blue lines refer to progression in video time.
Figure 2: Additional qualitative comparison with state-of-the-art methods for VSR. Only the proposed StableVSR correctly upscales complex textures.
Figure 3: Comparison between guidance on $x_t$ and $\tilde{x}_0$. Compared to $x_t$ (first column), $\tilde{x}_0$ computed via Eq. \ref{['eq:x0']} contains very little noise regardless of the sampling step $t$ (second column). We can observe $\tilde{x}_0$ is closer to $x_0$ as $t$ decreases (third column). Here, $x_0$ corresponds to the last sampling step, i.e. when $t=1$. In addition, $\tilde{x}_0$ increases its level of detail as $t$ decreases (fourth column).
Figure 3: Additional ablation experiments for the Temporal Texture Guidance. We show the results obtained on three consecutive frames. Only the proposed solution ensures temporal consistency at the fine-detail level over time. Results on sequence 015 of REDS4.
...and 7 more figures

Enhancing Perceptual Quality in Video Super-Resolution through Temporally-Consistent Detail Synthesis using Diffusion Models

TL;DR

Abstract

Enhancing Perceptual Quality in Video Super-Resolution through Temporally-Consistent Detail Synthesis using Diffusion Models

Authors

TL;DR

Abstract

Table of Contents

Figures (12)