Table of Contents
Fetching ...

Arbitrary-Scale Video Super-Resolution with Structural and Textural Priors

Wei Shang, Dongwei Ren, Wanying Zhang, Yuming Fang, Wangmeng Zuo, Kede Ma

TL;DR

This work tackles arbitrary-scale video super-resolution (AVSR) by proposing a strong baseline (B-AVSR) that fuses a flow-guided recurrent unit, a flow-refined cross-attention unit, and a hyper-upsampling module. It then advances ST-AVSR by incorporating a multi-scale structural and textural prior derived from a pre-trained VGG network, enabling scale-aware discrimination of structure and texture. The approach achieves state-of-the-art performance on REDS and Vid4, with better generalization to unseen scales and degradation models, while maintaining fast inference thanks to pre-computed upsampling kernels. The method offers practical AVSR capabilities for diverse applications, and the authors provide code for reproducibility at the linked repository.

Abstract

Arbitrary-scale video super-resolution (AVSR) aims to enhance the resolution of video frames, potentially at various scaling factors, which presents several challenges regarding spatial detail reproduction, temporal consistency, and computational complexity. In this paper, we first describe a strong baseline for AVSR by putting together three variants of elementary building blocks: 1) a flow-guided recurrent unit that aggregates spatiotemporal information from previous frames, 2) a flow-refined cross-attention unit that selects spatiotemporal information from future frames, and 3) a hyper-upsampling unit that generates scaleaware and content-independent upsampling kernels. We then introduce ST-AVSR by equipping our baseline with a multi-scale structural and textural prior computed from the pre-trained VGG network. This prior has proven effective in discriminating structure and texture across different locations and scales, which is beneficial for AVSR. Comprehensive experiments show that ST-AVSR significantly improves super-resolution quality, generalization ability, and inference speed over the state-of-theart. The code is available at https://github.com/shangwei5/ST-AVSR.

Arbitrary-Scale Video Super-Resolution with Structural and Textural Priors

TL;DR

This work tackles arbitrary-scale video super-resolution (AVSR) by proposing a strong baseline (B-AVSR) that fuses a flow-guided recurrent unit, a flow-refined cross-attention unit, and a hyper-upsampling module. It then advances ST-AVSR by incorporating a multi-scale structural and textural prior derived from a pre-trained VGG network, enabling scale-aware discrimination of structure and texture. The approach achieves state-of-the-art performance on REDS and Vid4, with better generalization to unseen scales and degradation models, while maintaining fast inference thanks to pre-computed upsampling kernels. The method offers practical AVSR capabilities for diverse applications, and the authors provide code for reproducibility at the linked repository.

Abstract

Arbitrary-scale video super-resolution (AVSR) aims to enhance the resolution of video frames, potentially at various scaling factors, which presents several challenges regarding spatial detail reproduction, temporal consistency, and computational complexity. In this paper, we first describe a strong baseline for AVSR by putting together three variants of elementary building blocks: 1) a flow-guided recurrent unit that aggregates spatiotemporal information from previous frames, 2) a flow-refined cross-attention unit that selects spatiotemporal information from future frames, and 3) a hyper-upsampling unit that generates scaleaware and content-independent upsampling kernels. We then introduce ST-AVSR by equipping our baseline with a multi-scale structural and textural prior computed from the pre-trained VGG network. This prior has proven effective in discriminating structure and texture across different locations and scales, which is beneficial for AVSR. Comprehensive experiments show that ST-AVSR significantly improves super-resolution quality, generalization ability, and inference speed over the state-of-theart. The code is available at https://github.com/shangwei5/ST-AVSR.
Paper Structure (25 sections, 10 equations, 13 figures, 3 tables)

This paper contains 25 sections, 10 equations, 13 figures, 3 tables.

Figures (13)

  • Figure 1: Visualization of our multi-scale structural and textural prior derived from the pre-trained VGG network. A warmer color indicates a higher probability that the local patch at a given scale will be perceived as visual texture. Image borrowed from ding2021adists with permission.
  • Figure 2: System diagram of B-AVSR, which reconstructs an arbitrary-scale HR video $\hat{\bm y}$ from an LR video input $\bm x$. B-AVSR is composed of three variants of elementary building blocks: 1) a flow-guided recurrent unit to aggregate features from previous frames, 2) a flow-refined cross-attention unit to select features from future frames (see also Fig. \ref{['fig:local']}), and 3) a hyper-upsampling unit to prepare SR features and predict SR kernels for HR frame reconstruction. ST-AVSR is built on top of B-AVSR by replacing all instances of $\bm x$ with the multi-scale structural and textural prior $\bm p$ (see the detailed text description in Sec. \ref{['subsec:st']}).
  • Figure 3: Computational structure of the flow-refined cross-attention unit.
  • Figure 4: Visual comparison of different AVSR methods on the REDS dataset. Zoom in for better distortion visibility.
  • Figure 5: PSNR and LPIPS variations for different scaling factors on Vid4.
  • ...and 8 more figures