Video Super-Resolution Transformer with Masked Inter&Intra-Frame Attention
Xingyu Zhou, Leheng Zhang, Xiaorui Zhao, Keze Wang, Leida Li, Shuhang Gu
TL;DR
This work tackles the high computational and memory burden of Transformer-based video super-resolution by introducing MIA-VSR, a feature-level masked processing framework that exploits temporal continuity. It jointly presents a novel Inter&Intra-Frame Attention Block (IIAB) and an adaptive block-wise mask predictor to skip redundant computations, while preserving SR quality. Empirical results on REDS, Vimeo90K, and Vid4 show that MIA-VSR achieves competitive or superior PSNR/SSIM with substantially lower FLOPs and memory usage, including effective light-weight variants. The approach offers a practical path toward deploying high-performance VSR on constrained devices, with clear design principles for efficient temporal processing in recurrent Transformer architectures.
Abstract
Recently, Vision Transformer has achieved great success in recovering missing details in low-resolution sequences, i.e., the video super-resolution (VSR) task. Despite its superiority in VSR accuracy, the heavy computational burden as well as the large memory footprint hinder the deployment of Transformer-based VSR models on constrained devices. In this paper, we address the above issue by proposing a novel feature-level masked processing framework: VSR with Masked Intra and inter frame Attention (MIA-VSR). The core of MIA-VSR is leveraging feature-level temporal continuity between adjacent frames to reduce redundant computations and make more rational use of previously enhanced SR features. Concretely, we propose an intra-frame and inter-frame attention block which takes the respective roles of past features and input features into consideration and only exploits previously enhanced features to provide supplementary information. In addition, an adaptive block-wise mask prediction module is developed to skip unimportant computations according to feature similarity between adjacent frames. We conduct detailed ablation studies to validate our contributions and compare the proposed method with recent state-of-the-art VSR approaches. The experimental results demonstrate that MIA-VSR improves the memory and computation efficiency over state-of-the-art methods, without trading off PSNR accuracy. The code is available at https://github.com/LabShuHangGU/MIA-VSR.
