Table of Contents
Fetching ...

DC-VSR: Spatially and Temporally Consistent Video Super-Resolution with Video Diffusion Prior

Janghyeok Han, Gyujin Sim, Geonung Kim, Hyun-seung Lee, Kyuha Choi, Youngseok Han, Sunghyun Cho

TL;DR

DC-VSR addresses the ill-posed problem of video super-resolution by leveraging a video diffusion prior to synthesize high-frequency textures while enforcing spatio-temporal consistency. It introduces Spatial Attention Propagation (SAP) and Temporal Attention Propagation (TAP) to propagate information across spatio-temporal tiles, and Detail-Suppression Self-Attention Guidance (DSSAG) to sharpen details without adding computational overhead, all within an alternating diffusion-step framework. Built on Stable Video Diffusion, the method processes tiles of size $64\times64\times14$ with a $50\%$ overlap and uses cross-tile guidance from frames with rich details ($L=4$) via TAP, along with a gamma-controlled attention mechanism in DSSAG. Empirical results on REDS4, UDM10, and VideoLQ show improved temporal consistency (e.g., tOF, tLP) and competitive spatial quality (PSNR/SSIM) compared with state-of-the-art generative priors, along with strong perceptual quality metrics (MUSIQ, DOVER). These findings demonstrate that integrating a video diffusion prior with cross-tile attention and flexible guidance yields robust, texture-rich VSR for real-world degraded videos, with practical implications for long-form video enhancement.

Abstract

Video super-resolution (VSR) aims to reconstruct a high-resolution (HR) video from a low-resolution (LR) counterpart. Achieving successful VSR requires producing realistic HR details and ensuring both spatial and temporal consistency. To restore realistic details, diffusion-based VSR approaches have recently been proposed. However, the inherent randomness of diffusion, combined with their tile-based approach, often leads to spatio-temporal inconsistencies. In this paper, we propose DC-VSR, a novel VSR approach to produce spatially and temporally consistent VSR results with realistic textures. To achieve spatial and temporal consistency, DC-VSR adopts a novel Spatial Attention Propagation (SAP) scheme and a Temporal Attention Propagation (TAP) scheme that propagate information across spatio-temporal tiles based on the self-attention mechanism. To enhance high-frequency details, we also introduce Detail-Suppression Self-Attention Guidance (DSSAG), a novel diffusion guidance scheme. Comprehensive experiments demonstrate that DC-VSR achieves spatially and temporally consistent, high-quality VSR results, outperforming previous approaches.

DC-VSR: Spatially and Temporally Consistent Video Super-Resolution with Video Diffusion Prior

TL;DR

DC-VSR addresses the ill-posed problem of video super-resolution by leveraging a video diffusion prior to synthesize high-frequency textures while enforcing spatio-temporal consistency. It introduces Spatial Attention Propagation (SAP) and Temporal Attention Propagation (TAP) to propagate information across spatio-temporal tiles, and Detail-Suppression Self-Attention Guidance (DSSAG) to sharpen details without adding computational overhead, all within an alternating diffusion-step framework. Built on Stable Video Diffusion, the method processes tiles of size with a overlap and uses cross-tile guidance from frames with rich details () via TAP, along with a gamma-controlled attention mechanism in DSSAG. Empirical results on REDS4, UDM10, and VideoLQ show improved temporal consistency (e.g., tOF, tLP) and competitive spatial quality (PSNR/SSIM) compared with state-of-the-art generative priors, along with strong perceptual quality metrics (MUSIQ, DOVER). These findings demonstrate that integrating a video diffusion prior with cross-tile attention and flexible guidance yields robust, texture-rich VSR for real-world degraded videos, with practical implications for long-form video enhancement.

Abstract

Video super-resolution (VSR) aims to reconstruct a high-resolution (HR) video from a low-resolution (LR) counterpart. Achieving successful VSR requires producing realistic HR details and ensuring both spatial and temporal consistency. To restore realistic details, diffusion-based VSR approaches have recently been proposed. However, the inherent randomness of diffusion, combined with their tile-based approach, often leads to spatio-temporal inconsistencies. In this paper, we propose DC-VSR, a novel VSR approach to produce spatially and temporally consistent VSR results with realistic textures. To achieve spatial and temporal consistency, DC-VSR adopts a novel Spatial Attention Propagation (SAP) scheme and a Temporal Attention Propagation (TAP) scheme that propagate information across spatio-temporal tiles based on the self-attention mechanism. To enhance high-frequency details, we also introduce Detail-Suppression Self-Attention Guidance (DSSAG), a novel diffusion guidance scheme. Comprehensive experiments demonstrate that DC-VSR achieves spatially and temporally consistent, high-quality VSR results, outperforming previous approaches.

Paper Structure

This paper contains 32 sections, 12 equations, 12 figures, 8 tables.

Figures (12)

  • Figure 1: Overall pipeline of proposed DC-VSR. An encoded low-resolution video latent is concatenated to the current noisy latent $x_t$, and it undergoes alternating denoising processes using both Spatial Attention Propagation (SAP) and Temporal Attention Propagation (TAP). At this stage, the noisy latent is split and merged before and after the denoising process, respectively. After each denoising step, Detail-Suppression Self-Attention Guidance (DSSAG) is applied to enhance the quality of the image further.
  • Figure 2: Denoised results of unconditional term at the intermediate timestep (t=16 out of 25) (a) without and (b) with DSSAG. The input video is from the VideoLQ dataset chan2022investigating (sample 013).
  • Figure 3: (a)&(b) are VSR results of image diffusion prior-based methods. (c)&(d) show the effects on the proposed TAP. The input video is from the VideoLQ dataset chan2022investigating (sample 008).
  • Figure 4: Visualization of the effects on the proposed SAP. The input video is from Pexels (© Tima Miroshnichenko).
  • Figure 5: Qualitative comparison with previous guidance approaches. The input video is from the VideoLQ dataset chan2022investigating (sample 041).
  • ...and 7 more figures