Table of Contents
Fetching ...

Learning Spatial Adaptation and Temporal Coherence in Diffusion Models for Video Super-Resolution

Zhikai Chen, Fuchen Long, Zhaofan Qiu, Ting Yao, Wengang Zhou, Jiebo Luo, Tao Mei

TL;DR

SATeCo pivots on learning spatial-temporal guidance from low-resolution videos to calibrate both latent-space high-resolution video denoising and pixel-space video reconstruction, and optimizes two deliberately-designed spatial feature adaptation and temporal feature alignment modules in the decoder of UNet and VAE.

Abstract

Diffusion models are just at a tipping point for image super-resolution task. Nevertheless, it is not trivial to capitalize on diffusion models for video super-resolution which necessitates not only the preservation of visual appearance from low-resolution to high-resolution videos, but also the temporal consistency across video frames. In this paper, we propose a novel approach, pursuing Spatial Adaptation and Temporal Coherence (SATeCo), for video super-resolution. SATeCo pivots on learning spatial-temporal guidance from low-resolution videos to calibrate both latent-space high-resolution video denoising and pixel-space video reconstruction. Technically, SATeCo freezes all the parameters of the pre-trained UNet and VAE, and only optimizes two deliberately-designed spatial feature adaptation (SFA) and temporal feature alignment (TFA) modules, in the decoder of UNet and VAE. SFA modulates frame features via adaptively estimating affine parameters for each pixel, guaranteeing pixel-wise guidance for high-resolution frame synthesis. TFA delves into feature interaction within a 3D local window (tubelet) through self-attention, and executes cross-attention between tubelet and its low-resolution counterpart to guide temporal feature alignment. Extensive experiments conducted on the REDS4 and Vid4 datasets demonstrate the effectiveness of our approach.

Learning Spatial Adaptation and Temporal Coherence in Diffusion Models for Video Super-Resolution

TL;DR

SATeCo pivots on learning spatial-temporal guidance from low-resolution videos to calibrate both latent-space high-resolution video denoising and pixel-space video reconstruction, and optimizes two deliberately-designed spatial feature adaptation and temporal feature alignment modules in the decoder of UNet and VAE.

Abstract

Diffusion models are just at a tipping point for image super-resolution task. Nevertheless, it is not trivial to capitalize on diffusion models for video super-resolution which necessitates not only the preservation of visual appearance from low-resolution to high-resolution videos, but also the temporal consistency across video frames. In this paper, we propose a novel approach, pursuing Spatial Adaptation and Temporal Coherence (SATeCo), for video super-resolution. SATeCo pivots on learning spatial-temporal guidance from low-resolution videos to calibrate both latent-space high-resolution video denoising and pixel-space video reconstruction. Technically, SATeCo freezes all the parameters of the pre-trained UNet and VAE, and only optimizes two deliberately-designed spatial feature adaptation (SFA) and temporal feature alignment (TFA) modules, in the decoder of UNet and VAE. SFA modulates frame features via adaptively estimating affine parameters for each pixel, guaranteeing pixel-wise guidance for high-resolution frame synthesis. TFA delves into feature interaction within a 3D local window (tubelet) through self-attention, and executes cross-attention between tubelet and its low-resolution counterpart to guide temporal feature alignment. Extensive experiments conducted on the REDS4 and Vid4 datasets demonstrate the effectiveness of our approach.
Paper Structure (13 sections, 5 equations, 7 figures, 3 tables)

This paper contains 13 sections, 5 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: An illustration of video super-resolution by using different approaches of StableSR wang2023stablesr, VRT liang2022vrt and our SATeCo to generate two adjacent frames. The region in the same local position is presented in the zoom-in view.
  • Figure 2: An overview of our SATeCo architecture. The input LR video $X_L$ is first up-sampled to the target resolution via a transformer-based video upscaler. Then, the up-sampled video $X_u$ is fed into the VAE encoder to extract the video features and latent code $Z$. Next, the Gaussian noise is added into $Z$ according to the diffusion scheduler, and the noisy video latent code is then restored by UNet for quality enhancement. In latent space, a latent encoder extracts the LR latent feature maps $G$ on the LR latent code $Z$, followed by spatial feature adaptation (SFA) and temporal feature alignment (TFA) modules in each decoder block of UNet for spatial-temporal guidance learning. Given the denoised video latent code $Z_0$, the VAE decoder decodes the video $X_d$ based on the guidance learnt by SFA and TFA on LR video features. Finally, the decoded video $X_d$ is adjusted by a video refiner via referring to $X_u$ for final HR video $X_H$ synthesis.
  • Figure 3: An illustration of (a) video upscaler, (b) video refiner, (c) spatial feature adaptation and (d) temporal feature alignment module.
  • Figure 4: Six visual examples of video super-resolution results by different approaches on the REDS4 and Vid4 datasets. The region in the red box is presented in the zoom-in view for comparison.
  • Figure 5: Video super-resolution results of two videos in the Vid4 dataset. The region in the same local position across two adjacent frames (i.e., regions highlighted by red and blue boxes) is scaled up to show more details.
  • ...and 2 more figures