Table of Contents
Fetching ...

MAT: Multi-Range Attention Transformer for Efficient Image Super-Resolution

Chengxing Xie, Xiaoming Zhang, Linze Li, Yuqian Fu, Biao Gong, Tianrui Li, Kai Zhang

TL;DR

MAT addresses the inefficiency of widening self-attention in SR by introducing multi-range attention (MA) and sparse multi-range attention (SMA) that operate over regional and dilated global neighborhoods. It combines these with a Local Aggregation Block (LAB) and MSConvStar to capture hierarchical features across local, regional, and sparse global scales. Experiments on lightweight and classical SR benchmarks show MAT achieving state-of-the-art performance while reducing parameters and computation, with MAT-light about 3.3x faster than a recent transformer-based SR model. The approach demonstrates practical benefits for real-time SR and scalable high-quality restoration.

Abstract

Image super-resolution (SR) has significantly advanced through the adoption of Transformer architectures. However, conventional techniques aimed at enlarging the self-attention window to capture broader contexts come with inherent drawbacks, especially the significantly increased computational demands. Moreover, the feature perception within a fixed-size window of existing models restricts the effective receptive field (ERF) and the intermediate feature diversity. We demonstrate that a flexible integration of attention across diverse spatial extents can yield significant performance enhancements. In line with this insight, we introduce Multi-Range Attention Transformer (MAT) for SR tasks. MAT leverages the computational advantages inherent in dilation operation, in conjunction with self-attention mechanism, to facilitate both multi-range attention (MA) and sparse multi-range attention (SMA), enabling efficient capture of both regional and sparse global features. Combined with local feature extraction, MAT adeptly capture dependencies across various spatial ranges, improving the diversity and efficacy of its feature representations. We also introduce the MSConvStar module, which augments the model's ability for multi-range representation learning. Comprehensive experiments show that our MAT exhibits superior performance to existing state-of-the-art SR models with remarkable efficiency (~3.3 faster than SRFormer-light).

MAT: Multi-Range Attention Transformer for Efficient Image Super-Resolution

TL;DR

MAT addresses the inefficiency of widening self-attention in SR by introducing multi-range attention (MA) and sparse multi-range attention (SMA) that operate over regional and dilated global neighborhoods. It combines these with a Local Aggregation Block (LAB) and MSConvStar to capture hierarchical features across local, regional, and sparse global scales. Experiments on lightweight and classical SR benchmarks show MAT achieving state-of-the-art performance while reducing parameters and computation, with MAT-light about 3.3x faster than a recent transformer-based SR model. The approach demonstrates practical benefits for real-time SR and scalable high-quality restoration.

Abstract

Image super-resolution (SR) has significantly advanced through the adoption of Transformer architectures. However, conventional techniques aimed at enlarging the self-attention window to capture broader contexts come with inherent drawbacks, especially the significantly increased computational demands. Moreover, the feature perception within a fixed-size window of existing models restricts the effective receptive field (ERF) and the intermediate feature diversity. We demonstrate that a flexible integration of attention across diverse spatial extents can yield significant performance enhancements. In line with this insight, we introduce Multi-Range Attention Transformer (MAT) for SR tasks. MAT leverages the computational advantages inherent in dilation operation, in conjunction with self-attention mechanism, to facilitate both multi-range attention (MA) and sparse multi-range attention (SMA), enabling efficient capture of both regional and sparse global features. Combined with local feature extraction, MAT adeptly capture dependencies across various spatial ranges, improving the diversity and efficacy of its feature representations. We also introduce the MSConvStar module, which augments the model's ability for multi-range representation learning. Comprehensive experiments show that our MAT exhibits superior performance to existing state-of-the-art SR models with remarkable efficiency (~3.3 faster than SRFormer-light).

Paper Structure

This paper contains 18 sections, 10 equations, 16 figures, 8 tables.

Figures (16)

  • Figure 1: Comparison of trade-offs between model performance and overheads on Urban100 huang2015single for $\times 4$ SR. The area of each circle denotes the Multi-Adds of these models. Our MAT-light achieves optimal SR performance with fewer parameters and Multi-Adds.
  • Figure 2: (a) Illustration of image redundancy in natural images with self-similarity. Efficiently utilizing similar and repetitive structures and elements in natural images can aid in the reconstruction of image features. (b) In natural images, features can be observed to form a hierarchical structure across different spatial scales. The single and fixed-size WSA is insufficient to fully leverage such hierarchical features.
  • Figure 3: The overall architecture of Multi-Range Attention Transformer (MAT).
  • Figure 4: Illustration of the window self-attention (WSA), multi-range attention (MA) and sparse multi-range attention (SMA). MA and SMA set different range sizes for different attention heads, enabling the multi-range representation learning.
  • Figure 5: Illustration of FFN liang2021swinir, ConvFFN zhou2023srformer, ConvStar and MSConvStar. $\odot$: element-wise multiplication (star).
  • ...and 11 more figures