DC-VSR: Spatially and Temporally Consistent Video Super-Resolution with Video Diffusion Prior
Janghyeok Han, Gyujin Sim, Geonung Kim, Hyun-seung Lee, Kyuha Choi, Youngseok Han, Sunghyun Cho
TL;DR
DC-VSR addresses the ill-posed problem of video super-resolution by leveraging a video diffusion prior to synthesize high-frequency textures while enforcing spatio-temporal consistency. It introduces Spatial Attention Propagation (SAP) and Temporal Attention Propagation (TAP) to propagate information across spatio-temporal tiles, and Detail-Suppression Self-Attention Guidance (DSSAG) to sharpen details without adding computational overhead, all within an alternating diffusion-step framework. Built on Stable Video Diffusion, the method processes tiles of size $64\times64\times14$ with a $50\%$ overlap and uses cross-tile guidance from frames with rich details ($L=4$) via TAP, along with a gamma-controlled attention mechanism in DSSAG. Empirical results on REDS4, UDM10, and VideoLQ show improved temporal consistency (e.g., tOF, tLP) and competitive spatial quality (PSNR/SSIM) compared with state-of-the-art generative priors, along with strong perceptual quality metrics (MUSIQ, DOVER). These findings demonstrate that integrating a video diffusion prior with cross-tile attention and flexible guidance yields robust, texture-rich VSR for real-world degraded videos, with practical implications for long-form video enhancement.
Abstract
Video super-resolution (VSR) aims to reconstruct a high-resolution (HR) video from a low-resolution (LR) counterpart. Achieving successful VSR requires producing realistic HR details and ensuring both spatial and temporal consistency. To restore realistic details, diffusion-based VSR approaches have recently been proposed. However, the inherent randomness of diffusion, combined with their tile-based approach, often leads to spatio-temporal inconsistencies. In this paper, we propose DC-VSR, a novel VSR approach to produce spatially and temporally consistent VSR results with realistic textures. To achieve spatial and temporal consistency, DC-VSR adopts a novel Spatial Attention Propagation (SAP) scheme and a Temporal Attention Propagation (TAP) scheme that propagate information across spatio-temporal tiles based on the self-attention mechanism. To enhance high-frequency details, we also introduce Detail-Suppression Self-Attention Guidance (DSSAG), a novel diffusion guidance scheme. Comprehensive experiments demonstrate that DC-VSR achieves spatially and temporally consistent, high-quality VSR results, outperforming previous approaches.
