Table of Contents
Fetching ...

Video Super-Resolution Transformer with Masked Inter&Intra-Frame Attention

Xingyu Zhou, Leheng Zhang, Xiaorui Zhao, Keze Wang, Leida Li, Shuhang Gu

TL;DR

This work tackles the high computational and memory burden of Transformer-based video super-resolution by introducing MIA-VSR, a feature-level masked processing framework that exploits temporal continuity. It jointly presents a novel Inter&Intra-Frame Attention Block (IIAB) and an adaptive block-wise mask predictor to skip redundant computations, while preserving SR quality. Empirical results on REDS, Vimeo90K, and Vid4 show that MIA-VSR achieves competitive or superior PSNR/SSIM with substantially lower FLOPs and memory usage, including effective light-weight variants. The approach offers a practical path toward deploying high-performance VSR on constrained devices, with clear design principles for efficient temporal processing in recurrent Transformer architectures.

Abstract

Recently, Vision Transformer has achieved great success in recovering missing details in low-resolution sequences, i.e., the video super-resolution (VSR) task. Despite its superiority in VSR accuracy, the heavy computational burden as well as the large memory footprint hinder the deployment of Transformer-based VSR models on constrained devices. In this paper, we address the above issue by proposing a novel feature-level masked processing framework: VSR with Masked Intra and inter frame Attention (MIA-VSR). The core of MIA-VSR is leveraging feature-level temporal continuity between adjacent frames to reduce redundant computations and make more rational use of previously enhanced SR features. Concretely, we propose an intra-frame and inter-frame attention block which takes the respective roles of past features and input features into consideration and only exploits previously enhanced features to provide supplementary information. In addition, an adaptive block-wise mask prediction module is developed to skip unimportant computations according to feature similarity between adjacent frames. We conduct detailed ablation studies to validate our contributions and compare the proposed method with recent state-of-the-art VSR approaches. The experimental results demonstrate that MIA-VSR improves the memory and computation efficiency over state-of-the-art methods, without trading off PSNR accuracy. The code is available at https://github.com/LabShuHangGU/MIA-VSR.

Video Super-Resolution Transformer with Masked Inter&Intra-Frame Attention

TL;DR

This work tackles the high computational and memory burden of Transformer-based video super-resolution by introducing MIA-VSR, a feature-level masked processing framework that exploits temporal continuity. It jointly presents a novel Inter&Intra-Frame Attention Block (IIAB) and an adaptive block-wise mask predictor to skip redundant computations, while preserving SR quality. Empirical results on REDS, Vimeo90K, and Vid4 show that MIA-VSR achieves competitive or superior PSNR/SSIM with substantially lower FLOPs and memory usage, including effective light-weight variants. The approach offers a practical path toward deploying high-performance VSR on constrained devices, with clear design principles for efficient temporal processing in recurrent Transformer architectures.

Abstract

Recently, Vision Transformer has achieved great success in recovering missing details in low-resolution sequences, i.e., the video super-resolution (VSR) task. Despite its superiority in VSR accuracy, the heavy computational burden as well as the large memory footprint hinder the deployment of Transformer-based VSR models on constrained devices. In this paper, we address the above issue by proposing a novel feature-level masked processing framework: VSR with Masked Intra and inter frame Attention (MIA-VSR). The core of MIA-VSR is leveraging feature-level temporal continuity between adjacent frames to reduce redundant computations and make more rational use of previously enhanced SR features. Concretely, we propose an intra-frame and inter-frame attention block which takes the respective roles of past features and input features into consideration and only exploits previously enhanced features to provide supplementary information. In addition, an adaptive block-wise mask prediction module is developed to skip unimportant computations according to feature similarity between adjacent frames. We conduct detailed ablation studies to validate our contributions and compare the proposed method with recent state-of-the-art VSR approaches. The experimental results demonstrate that MIA-VSR improves the memory and computation efficiency over state-of-the-art methods, without trading off PSNR accuracy. The code is available at https://github.com/LabShuHangGU/MIA-VSR.
Paper Structure (23 sections, 10 equations, 8 figures, 4 tables)

This paper contains 23 sections, 10 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: PSNR(dB) and FLOPs(G) comparison on the Vid4 liu2013bayesian dataset. We compare our MIA-VSR model with the state-of-the-art temporal sliding-window and recurrent based VSR models, including EDVR wang2019edvr, BasicVSR++ chan2022basicvsr++, VRT liang2022vrt, RVRT liang2022recurrent and PSRT shi2022rethinking. Our MIA-VSR model outperforms these methods and strikes a balance between performance and compute efficiency.
  • Figure 2: The overall architecture of MIA-VSR. We develop a feature-level masked processing framework which uses the mask prediction module (MPM) to reduce redundant computations by leveraging temporal continuity, and propose a masked intra-frame and inter-frame (MIA) block to make more rational use of previous enhanced features to support the feature enhancement of the current frame. Our MIA-VSR model can be easily extended to the bi-directional second-order grid propagation framework as chan2022basicvsr++. More details of our proposed MIA-VSR can be found in Section \ref{['sec:Method']}.
  • Figure 3: Illustration of the inter&intra-frame attention block with adaptive masked processing module. The adaptive mask prediction module (b) in the IIAB block acts in the Attention module's linear layer which produce the Query, the projection layer and the linear layer in the FFN module during inference to reduce temporal and sptical redundancy calculations (a). ${\bm{X}_{m,n}^t}^{\prime}$ and ${\bm{X}_{m,n}^t}^{\prime\prime}$ refer to the processed hidden feature in the Attention and FFN module.
  • Figure 4: Visualization of predicted masks for a sequence in the REDS dataset.
  • Figure 5: Visual comparison for $4\times$ VSR on REDS4 dataset and Vid4 dataset.
  • ...and 3 more figures