Table of Contents
Fetching ...

BasicAVSR: Arbitrary-Scale Video Super-Resolution via Image Priors and Enhanced Motion Compensation

Wei Shang, Wanying Zhang, Shuhang Gu, Pengfei Zhu, Qinghua Hu, Dongwei Ren

TL;DR

AVSR at arbitrary scales is challenging due to the need for faithful texture restoration and temporal consistency across diverse factors. BasicAVSR addresses this by integrating four components: multi-scale Laplacian-prior frequency cues, a flow-guided propagation module for temporal aggregation, a second-order motion compensation unit for accurate alignment, and a hyper-upsampling unit that generates scale-aware kernels with pre-computed options. The approach delivers state-of-the-art SR quality, strong generalization to unseen degradations and scales, and flexible online/offline deployment with efficient inference. This makes AVSR more practical for real-world streaming and offline processing, while code availability facilitates reproducibility and further development.

Abstract

Arbitrary-scale video super-resolution (AVSR) aims to enhance the resolution of video frames, potentially at various scaling factors, which presents several challenges regarding spatial detail reproduction, temporal consistency, and computational complexity. In this paper, we propose a strong baseline BasicAVSR for AVSR by integrating four key components: 1) adaptive multi-scale frequency priors generated from image Laplacian pyramids, 2) a flow-guided propagation unit to aggregate spatiotemporal information from adjacent frames, 3) a second-order motion compensation unit for more accurate spatial alignment of adjacent frames, and 4) a hyper-upsampling unit to generate scale-aware and content-independent upsampling kernels. To meet diverse application demands, we instantiate three propagation variants: (i) a unidirectional RNN unit for strictly online inference, (ii) a unidirectional RNN unit empowered with a limited lookahead that tolerates a small output delay, and (iii) a bidirectional RNN unit designed for offline tasks where computational resources are less constrained. Experimental results demonstrate the effectiveness and adaptability of our model across these different scenarios. Through extensive experiments, we show that BasicAVSR significantly outperforms existing methods in terms of super-resolution quality, generalization ability, and inference speed. Our work not only advances the state-of-the-art in AVSR but also extends its core components to multiple frameworks for diverse scenarios. The code is available at https://github.com/shangwei5/BasicAVSR.

BasicAVSR: Arbitrary-Scale Video Super-Resolution via Image Priors and Enhanced Motion Compensation

TL;DR

AVSR at arbitrary scales is challenging due to the need for faithful texture restoration and temporal consistency across diverse factors. BasicAVSR addresses this by integrating four components: multi-scale Laplacian-prior frequency cues, a flow-guided propagation module for temporal aggregation, a second-order motion compensation unit for accurate alignment, and a hyper-upsampling unit that generates scale-aware kernels with pre-computed options. The approach delivers state-of-the-art SR quality, strong generalization to unseen degradations and scales, and flexible online/offline deployment with efficient inference. This makes AVSR more practical for real-world streaming and offline processing, while code availability facilitates reproducibility and further development.

Abstract

Arbitrary-scale video super-resolution (AVSR) aims to enhance the resolution of video frames, potentially at various scaling factors, which presents several challenges regarding spatial detail reproduction, temporal consistency, and computational complexity. In this paper, we propose a strong baseline BasicAVSR for AVSR by integrating four key components: 1) adaptive multi-scale frequency priors generated from image Laplacian pyramids, 2) a flow-guided propagation unit to aggregate spatiotemporal information from adjacent frames, 3) a second-order motion compensation unit for more accurate spatial alignment of adjacent frames, and 4) a hyper-upsampling unit to generate scale-aware and content-independent upsampling kernels. To meet diverse application demands, we instantiate three propagation variants: (i) a unidirectional RNN unit for strictly online inference, (ii) a unidirectional RNN unit empowered with a limited lookahead that tolerates a small output delay, and (iii) a bidirectional RNN unit designed for offline tasks where computational resources are less constrained. Experimental results demonstrate the effectiveness and adaptability of our model across these different scenarios. Through extensive experiments, we show that BasicAVSR significantly outperforms existing methods in terms of super-resolution quality, generalization ability, and inference speed. Our work not only advances the state-of-the-art in AVSR but also extends its core components to multiple frameworks for diverse scenarios. The code is available at https://github.com/shangwei5/BasicAVSR.

Paper Structure

This paper contains 23 sections, 6 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Visualization of Laplacian pyramid decomposition under different input resolutions. (a) Original image. (b) Laplacian pyramid visualization (high-resolution input). (c) Laplacian pyramid visualization (low-resolution input).
  • Figure 2: System diagram of BasicAVSR, which reconstructs an arbitrary-scale HR video $\hat{\bm y}$ from an LR video input $\bm x$. BasicAVSR is composed of four variants of elementary building blocks: 1) multi-scale frequency priors to provide scale-specific pixel-level priors for AVSR by replacing all instances of $\bm x$ with the multi-scale frequency prior $\bm p$ (see the detailed text description in Sec. \ref{['subsec:st']}), 2) a flow-guided propagation unit to aggregate features from adjacent frames, 3) a second-order motion compensation unit to mitigate misalignment in backward warping (see also Fig. \ref{['fig:local']}), and 4) a hyper-upsampling unit to prepare SR features and predict SR kernels for HR frame reconstruction.
  • Figure 3: Comparison of traditional alignment and our proposed motion compensation. The displacement is roughly estimated based on the optical flow, and then a window of size $r$ is expanded in the adjacent frames with the roughly estimated pixel coordinates as the center to search for the pixel most similar to the source pixel to complete the motion compensation.
  • Figure 4: Data pre-processing and training pipeline for BasicAVSR.
  • Figure 5: Visual comparison of different AVSR methods on the REDS dataset. Zoom in for better distortion visibility.
  • ...and 5 more figures