Table of Contents
Fetching ...

Transforming Image Super-Resolution: A ConvFormer-based Efficient Approach

Gang Wu, Junjun Jiang, Junpeng Jiang, Xianming Liu

TL;DR

The Convolutional Transformer layer (ConvFormer) is introduced and a ConvFormer-based Super-Resolution network (CFSR) is proposed, offering an effective and efficient solution for lightweight image super-resolution.

Abstract

Recent progress in single-image super-resolution (SISR) has achieved remarkable performance, yet the computational costs of these methods remain a challenge for deployment on resource-constrained devices. In particular, transformer-based methods, which leverage self-attention mechanisms, have led to significant breakthroughs but also introduce substantial computational costs. To tackle this issue, we introduce the Convolutional Transformer layer (ConvFormer) and propose a ConvFormer-based Super-Resolution network (CFSR), offering an effective and efficient solution for lightweight image super-resolution. The proposed method inherits the advantages of both convolution-based and transformer-based approaches. Specifically, CFSR utilizes large kernel convolutions as a feature mixer to replace the self-attention module, efficiently modeling long-range dependencies and extensive receptive fields with minimal computational overhead. Furthermore, we propose an edge-preserving feed-forward network (EFN) designed to achieve local feature aggregation while effectively preserving high-frequency information. Extensive experiments demonstrate that CFSR strikes an optimal balance between computational cost and performance compared to existing lightweight SR methods. When benchmarked against state-of-the-art methods such as ShuffleMixer, the proposed CFSR achieves a gain of 0.39 dB on the Urban100 dataset for the x2 super-resolution task while requiring 26\% and 31\% fewer parameters and FLOPs, respectively. The code and pre-trained models are available at https://github.com/Aitical/CFSR.

Transforming Image Super-Resolution: A ConvFormer-based Efficient Approach

TL;DR

The Convolutional Transformer layer (ConvFormer) is introduced and a ConvFormer-based Super-Resolution network (CFSR) is proposed, offering an effective and efficient solution for lightweight image super-resolution.

Abstract

Recent progress in single-image super-resolution (SISR) has achieved remarkable performance, yet the computational costs of these methods remain a challenge for deployment on resource-constrained devices. In particular, transformer-based methods, which leverage self-attention mechanisms, have led to significant breakthroughs but also introduce substantial computational costs. To tackle this issue, we introduce the Convolutional Transformer layer (ConvFormer) and propose a ConvFormer-based Super-Resolution network (CFSR), offering an effective and efficient solution for lightweight image super-resolution. The proposed method inherits the advantages of both convolution-based and transformer-based approaches. Specifically, CFSR utilizes large kernel convolutions as a feature mixer to replace the self-attention module, efficiently modeling long-range dependencies and extensive receptive fields with minimal computational overhead. Furthermore, we propose an edge-preserving feed-forward network (EFN) designed to achieve local feature aggregation while effectively preserving high-frequency information. Extensive experiments demonstrate that CFSR strikes an optimal balance between computational cost and performance compared to existing lightweight SR methods. When benchmarked against state-of-the-art methods such as ShuffleMixer, the proposed CFSR achieves a gain of 0.39 dB on the Urban100 dataset for the x2 super-resolution task while requiring 26\% and 31\% fewer parameters and FLOPs, respectively. The code and pre-trained models are available at https://github.com/Aitical/CFSR.
Paper Structure (15 sections, 15 equations, 8 figures, 5 tables)

This paper contains 15 sections, 15 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Illustration of PSNR, FLOPs, and parameter counts of different SISR models on the Urban100 dataset for $4\times$ SR task. The proposed CFSR approach achieves superior performance with less computational cost.
  • Figure 2: Detailed implementation and different components in the proposed CFSR. The architecture of CFSR is mainly stacked by the basic residual block, which contains several ConvFormer layers. The ConvFormer Block plays a pivotal role, containing the proposed large kernel feature mixer (LK Mixer) and edge-preserving feed-forward network (EFN).
  • Figure 3: Illustration of the edge-preserving depth-wise convolution (EDC). It contains a multi-branch structure with pre-defined gradient kernels and is equivalent to a single $3\times3$ depth-wise convolution in inference by re-parameterizing.
  • Figure 4: LAMLAM attributions of different kernel sizes. From left to right, there is the reference input in the first column and LAM attributions of 5, 7, 9, 11 kernel sizes in the second and third columns.
  • Figure 5: Visual comparisons for SR($\times$4) methods on Set14, Manga109, and Urban100 datasets (Zoom in for more details).
  • ...and 3 more figures