Table of Contents
Fetching ...

MambaIR: A Simple Baseline for Image Restoration with State-Space Model

Hang Guo, Jinmin Li, Tao Dai, Zhihao Ouyang, Xudong Ren, Shu-Tao Xia

TL;DR

This work tackles the persistent trade-off between global receptive field coverage and computational efficiency in image restoration. It introduces MambaIR, a three-stage baseline built on Residual State-Space Blocks, Vision State-Space Modules, and a 2D Selective Scan Module, augmented with local enhancement and channel attention to mitigate local pixel forgetting and channel redundancy. Through extensive ablations and comparisons on SR and denoising tasks, MambaIR demonstrates strong performance gains over CNN/Transformer baselines (e.g., surpassing SwinIR by up to ~0.45dB in SR) while maintaining linear computational complexity. The approach offers a practical, scalable backbone for image restoration with potential applicability across diverse low-level vision tasks, with code released for reproducibility.

Abstract

Recent years have seen significant advancements in image restoration, largely attributed to the development of modern deep neural networks, such as CNNs and Transformers. However, existing restoration backbones often face the dilemma between global receptive fields and efficient computation, hindering their application in practice. Recently, the Selective Structured State Space Model, especially the improved version Mamba, has shown great potential for long-range dependency modeling with linear complexity, which offers a way to resolve the above dilemma. However, the standard Mamba still faces certain challenges in low-level vision such as local pixel forgetting and channel redundancy. In this work, we introduce a simple but effective baseline, named MambaIR, which introduces both local enhancement and channel attention to improve the vanilla Mamba. In this way, our MambaIR takes advantage of the local pixel similarity and reduces the channel redundancy. Extensive experiments demonstrate the superiority of our method, for example, MambaIR outperforms SwinIR by up to 0.45dB on image SR, using similar computational cost but with a global receptive field. Code is available at \url{https://github.com/csguoh/MambaIR}.

MambaIR: A Simple Baseline for Image Restoration with State-Space Model

TL;DR

This work tackles the persistent trade-off between global receptive field coverage and computational efficiency in image restoration. It introduces MambaIR, a three-stage baseline built on Residual State-Space Blocks, Vision State-Space Modules, and a 2D Selective Scan Module, augmented with local enhancement and channel attention to mitigate local pixel forgetting and channel redundancy. Through extensive ablations and comparisons on SR and denoising tasks, MambaIR demonstrates strong performance gains over CNN/Transformer baselines (e.g., surpassing SwinIR by up to ~0.45dB in SR) while maintaining linear computational complexity. The approach offers a practical, scalable backbone for image restoration with potential applicability across diverse low-level vision tasks, with code released for reproducibility.

Abstract

Recent years have seen significant advancements in image restoration, largely attributed to the development of modern deep neural networks, such as CNNs and Transformers. However, existing restoration backbones often face the dilemma between global receptive fields and efficient computation, hindering their application in practice. Recently, the Selective Structured State Space Model, especially the improved version Mamba, has shown great potential for long-range dependency modeling with linear complexity, which offers a way to resolve the above dilemma. However, the standard Mamba still faces certain challenges in low-level vision such as local pixel forgetting and channel redundancy. In this work, we introduce a simple but effective baseline, named MambaIR, which introduces both local enhancement and channel attention to improve the vanilla Mamba. In this way, our MambaIR takes advantage of the local pixel similarity and reduces the channel redundancy. Extensive experiments demonstrate the superiority of our method, for example, MambaIR outperforms SwinIR by up to 0.45dB on image SR, using similar computational cost but with a global receptive field. Code is available at \url{https://github.com/csguoh/MambaIR}.
Paper Structure (17 sections, 9 equations, 8 figures, 6 tables)

This paper contains 17 sections, 9 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: The Effective Receptive Field (ERF) visualization luo2016understandingding2022scaling for EDSR lim2017enhanced, RCAN zhang2018image, SwinIR liang2021swinir, HAT chen2023activating, and the proposed MambaIR. A larger ERF is indicated by a more extensively distributed dark area. The proposed MambaIR achieves a significant global effective receptive field.
  • Figure 1: Ablation experiments for different design choices of RSSB.
  • Figure 2: The overall network architecture of our MambaIR, as well as the (a) Residual State-Space Block (RSSB), the (b) Vision State-Space Module (VSSM), and the (c) 2D Selective Scan Module (2D-SSM).
  • Figure 3: (a) Without using local enhancement will cause spatially close pixels (area in the red box) get forgotten in the flattened 1D sequence due to the long distance. (b) We use RELU and global average pooling on the VSSM outputs from the last layer to get the channel activation values. Most channels are not activated (i.e., channel redundancy) when channel attention is not used.
  • Figure 4: Qualitative comparison of our MambaIR with CNN and Transformer based methods on classic image SR with scale $\times$4.
  • ...and 3 more figures