SRFormerV2: Taking a Closer Look at Permuted Self-Attention for Image Super-Resolution

Yupeng Zhou; Zhen Li; Chun-Le Guo; Li Liu; Ming-Ming Cheng; Qibin Hou

SRFormerV2: Taking a Closer Look at Permuted Self-Attention for Image Super-Resolution

Yupeng Zhou, Zhen Li, Chun-Le Guo, Li Liu, Ming-Ming Cheng, Qibin Hou

TL;DR

The paper introduces Permuted Self-Attention (PSA), an efficient mechanism that enables large-window self-attention for image super-resolution by compressing K,V channels and permuting spatial tokens into the channel dimension. This allows attention over large windows (up to $40\\times40$) with modest computational cost, enabling the SRFormer architecture and its ConvFFN enhancement to achieve state-of-the-art results across classical, lightweight, and real-world SR tasks. Building on SRFormer, SRFormerV2 broadens window sizes and channel capacity, incorporating small-window local blocks to fuse global and local information, and attains new state-of-the-art performance with competitive compute and parameters. The work demonstrates that PSA can serve as a scalable paradigm for SR backbone design, improving image reconstruction quality while preserving efficiency.

Abstract

Previous works have shown that increasing the window size for Transformer-based image super-resolution models (e.g., SwinIR) can significantly improve the model performance. Still, the computation overhead is also considerable when the window size gradually increases. In this paper, we present SRFormer, a simple but novel method that can enjoy the benefit of large window self-attention but introduces even less computational burden. The core of our SRFormer is the permuted self-attention (PSA), which strikes an appropriate balance between the channel and spatial information for self-attention. Without any bells and whistles, we show that our SRFormer achieves a 33.86dB PSNR score on the Urban100 dataset, which is 0.46dB higher than that of SwinIR but uses fewer parameters and computations. In addition, we also attempt to scale up the model by further enlarging the window size and channel numbers to explore the potential of Transformer-based models. Experiments show that our scaled model, named SRFormerV2, can further improve the results and achieves state-of-the-art. We hope our simple and effective approach could be useful for future research in super-resolution model design. The homepage is https://z-yupeng.github.io/SRFormer/.

SRFormerV2: Taking a Closer Look at Permuted Self-Attention for Image Super-Resolution

TL;DR

) with modest computational cost, enabling the SRFormer architecture and its ConvFFN enhancement to achieve state-of-the-art results across classical, lightweight, and real-world SR tasks. Building on SRFormer, SRFormerV2 broadens window sizes and channel capacity, incorporating small-window local blocks to fuse global and local information, and attains new state-of-the-art performance with competitive compute and parameters. The work demonstrates that PSA can serve as a scalable paradigm for SR backbone design, improving image reconstruction quality while preserving efficiency.

Abstract

Paper Structure (18 sections, 2 equations, 9 figures, 7 tables)

This paper contains 18 sections, 2 equations, 9 figures, 7 tables.

Introduction
Related Work
CNN-Based Image Super-Resolution
Vision Transformers
Method
Overall Architecture
Permuted Self-Attention Block
Large-Window Self-Attention Variants
SRFormerV2: Scaling the SRFormer
Experiments
Experimental Setup
Ablation Study
Classical Image Super-Resolution
Lightweight Image Super-Resolution
Real-World Image Super-Resolution
...and 3 more sections

Figures (9)

Figure 1: Performance comparison among SwinIR, HAT, our SRFormer and SRFormerV2. The radius of the circles represents the parameters of the models. "WS" stands for the attention window size, e.g. "WS: 24 " stands for $24\times24$ attention windows. Our SRFormer enjoys large window sizes with even fewer computations but higher PSNR scores.
Figure 2: Overall architecture of SRFormer. The pixel embedding module is a $3\times3$ convolution to map the input image to feature space. The HR image reconstruction module contains a $3\times3$ convolution and a pixel shuffle operation to reconstruct the high-resolution image. The middle feature encoding part has $N$ PAB groups, followed by a $3\times3$ convolution.
Figure 3: Comparison between (a) self-attention and (b) our proposed permuted self-attention. To avoid spatial information loss, we propose to reduce the channel numbers and transfer the spatial information to the channel dimension.
Figure 4: Power spectrum of the intermediate feature maps produced by our SRFormer with FFN and ConvFFN. Lines in darker color correspond to features from deeper layers.
Figure 5: The enhancement roadmap for SRFormerV2 begins with the original SRFormer. Through multifaceted improvements, we scale our SRFormer up and achieve significant performance enhancements. The backslash block indicates enhancement methods can also improve performance but are not adopted in SRFormerV2. The orange square represents the previous state-of-the-art (SOTA) methods, HAT chen2022activating.
...and 4 more figures

SRFormerV2: Taking a Closer Look at Permuted Self-Attention for Image Super-Resolution

TL;DR

Abstract

SRFormerV2: Taking a Closer Look at Permuted Self-Attention for Image Super-Resolution

Authors

TL;DR

Abstract

Table of Contents

Figures (9)