Table of Contents
Fetching ...

Activating Wider Areas in Image Super-Resolution

Cheng Cheng, Hang Wang, Hongbin Sun

TL;DR

This work unleash the representation potential of the modern state space model, i.e., Vision Mamba (Vim), in the context of SISR, to illuminate the potential applications of state space models in the broader realm of image processing rather than SISR, encouraging further exploration in this innovative direction.

Abstract

The prevalence of convolution neural networks (CNNs) and vision transformers (ViTs) has markedly revolutionized the area of single-image super-resolution (SISR). To further boost the SR performances, several techniques, such as residual learning and attention mechanism, are introduced, which can be largely attributed to a wider range of activated area, that is, the input pixels that strongly influence the SR results. However, the possibility of further improving SR performance through another versatile vision backbone remains an unresolved challenge. To address this issue, in this paper, we unleash the representation potential of the modern state space model, i.e., Vision Mamba (Vim), in the context of SISR. Specifically, we present three recipes for better utilization of Vim-based models: 1) Integration into a MetaFormer-style block; 2) Pre-training on a larger and broader dataset; 3) Employing complementary attention mechanism, upon which we introduce the MMA. The resulting network MMA is capable of finding the most relevant and representative input pixels to reconstruct the corresponding high-resolution images. Comprehensive experimental analysis reveals that MMA not only achieves competitive or even superior performance compared to state-of-the-art SISR methods but also maintains relatively low memory and computational overheads (e.g., +0.5 dB PSNR elevation on Manga109 dataset with 19.8 M parameters at the scale of 2). Furthermore, MMA proves its versatility in lightweight SR applications. Through this work, we aim to illuminate the potential applications of state space models in the broader realm of image processing rather than SISR, encouraging further exploration in this innovative direction.

Activating Wider Areas in Image Super-Resolution

TL;DR

This work unleash the representation potential of the modern state space model, i.e., Vision Mamba (Vim), in the context of SISR, to illuminate the potential applications of state space models in the broader realm of image processing rather than SISR, encouraging further exploration in this innovative direction.

Abstract

The prevalence of convolution neural networks (CNNs) and vision transformers (ViTs) has markedly revolutionized the area of single-image super-resolution (SISR). To further boost the SR performances, several techniques, such as residual learning and attention mechanism, are introduced, which can be largely attributed to a wider range of activated area, that is, the input pixels that strongly influence the SR results. However, the possibility of further improving SR performance through another versatile vision backbone remains an unresolved challenge. To address this issue, in this paper, we unleash the representation potential of the modern state space model, i.e., Vision Mamba (Vim), in the context of SISR. Specifically, we present three recipes for better utilization of Vim-based models: 1) Integration into a MetaFormer-style block; 2) Pre-training on a larger and broader dataset; 3) Employing complementary attention mechanism, upon which we introduce the MMA. The resulting network MMA is capable of finding the most relevant and representative input pixels to reconstruct the corresponding high-resolution images. Comprehensive experimental analysis reveals that MMA not only achieves competitive or even superior performance compared to state-of-the-art SISR methods but also maintains relatively low memory and computational overheads (e.g., +0.5 dB PSNR elevation on Manga109 dataset with 19.8 M parameters at the scale of 2). Furthermore, MMA proves its versatility in lightweight SR applications. Through this work, we aim to illuminate the potential applications of state space models in the broader realm of image processing rather than SISR, encouraging further exploration in this innovative direction.
Paper Structure (20 sections, 9 equations, 7 figures, 3 tables)

This paper contains 20 sections, 9 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: LAM comparison of three representative SR methods. The patches for interpretation are marked with red boxes in the original images. LAM emphasizes the pixel range engaged in the SR result reconstruction, quantified by the diffusion index (DI). A higher DI indicates that more pixels are involved. Best viewed by zooming.
  • Figure 2: (a) The network architecture of our MMA and the structure of MMA block; (b) The structure of channel attention (CA) block; (c) The structure of Vim block.
  • Figure 3: Visual comparison ($\times2$) between MMA and state-of-the-art SISR methods on BSD100 and Urban100 datasets. Best viewed by zooming. The highest PSNR are marked in bold. More visual results are provided in the supplementary material.
  • Figure 4: LAM comparison ($\times2$) between MMA and state-of-the-art SISR methods on Urban100 datasets. A higher DI indicates that more pixels are involved. Best viewed by zooming. More LAM results are provided in the supplementary material.
  • Figure 5: Model complexity comparison ($\times2$). PSNR (dB) on Manga109, #Params are reported.
  • ...and 2 more figures