Table of Contents
Fetching ...

VmambaIR: Visual State Space Model for Image Restoration

Yuan Shi, Bin Xia, Xiaoyu Jin, Xing Wang, Tianyu Zhao, Xin Xia, Xuefeng Xiao, Wenming Yang

TL;DR

VmambaIR addresses image restoration by leveraging State Space Models (SSMs) with linear complexity $O(n)$ to capture high-frequency information across scales, implemented in a Unet via Omni Selective Scan (OSS) blocks. The OSS block combines an OSS module and an Efficient Feed-Forward Network (EFFN) to model six-direction information flow, while the Omni Selective Scan mechanism extends Mamba with six-direction scanning across height, width, and channels for full spatial awareness with kept linear complexity. The architecture achieves state-of-the-art results on image deraining, single-image super-resolution, and real-world super-resolution while using substantially fewer parameters and FLOPs (notably ~26% of prior cost for real-world SR). This work demonstrates the potential of linear-complexity State Space Models as a robust, scalable foundation for next-generation low-level vision tasks, and provides a simple, strong baseline that avoids distillation or teacher networks while delivering high-fidelity restorations.

Abstract

Image restoration is a critical task in low-level computer vision, aiming to restore high-quality images from degraded inputs. Various models, such as convolutional neural networks (CNNs), generative adversarial networks (GANs), transformers, and diffusion models (DMs), have been employed to address this problem with significant impact. However, CNNs have limitations in capturing long-range dependencies. DMs require large prior models and computationally intensive denoising steps. Transformers have powerful modeling capabilities but face challenges due to quadratic complexity with input image size. To address these challenges, we propose VmambaIR, which introduces State Space Models (SSMs) with linear complexity into comprehensive image restoration tasks. We utilize a Unet architecture to stack our proposed Omni Selective Scan (OSS) blocks, consisting of an OSS module and an Efficient Feed-Forward Network (EFFN). Our proposed omni selective scan mechanism overcomes the unidirectional modeling limitation of SSMs by efficiently modeling image information flows in all six directions. Furthermore, we conducted a comprehensive evaluation of our VmambaIR across multiple image restoration tasks, including image deraining, single image super-resolution, and real-world image super-resolution. Extensive experimental results demonstrate that our proposed VmambaIR achieves state-of-the-art (SOTA) performance with much fewer computational resources and parameters. Our research highlights the potential of state space models as promising alternatives to the transformer and CNN architectures in serving as foundational frameworks for next-generation low-level visual tasks.

VmambaIR: Visual State Space Model for Image Restoration

TL;DR

VmambaIR addresses image restoration by leveraging State Space Models (SSMs) with linear complexity to capture high-frequency information across scales, implemented in a Unet via Omni Selective Scan (OSS) blocks. The OSS block combines an OSS module and an Efficient Feed-Forward Network (EFFN) to model six-direction information flow, while the Omni Selective Scan mechanism extends Mamba with six-direction scanning across height, width, and channels for full spatial awareness with kept linear complexity. The architecture achieves state-of-the-art results on image deraining, single-image super-resolution, and real-world super-resolution while using substantially fewer parameters and FLOPs (notably ~26% of prior cost for real-world SR). This work demonstrates the potential of linear-complexity State Space Models as a robust, scalable foundation for next-generation low-level vision tasks, and provides a simple, strong baseline that avoids distillation or teacher networks while delivering high-fidelity restorations.

Abstract

Image restoration is a critical task in low-level computer vision, aiming to restore high-quality images from degraded inputs. Various models, such as convolutional neural networks (CNNs), generative adversarial networks (GANs), transformers, and diffusion models (DMs), have been employed to address this problem with significant impact. However, CNNs have limitations in capturing long-range dependencies. DMs require large prior models and computationally intensive denoising steps. Transformers have powerful modeling capabilities but face challenges due to quadratic complexity with input image size. To address these challenges, we propose VmambaIR, which introduces State Space Models (SSMs) with linear complexity into comprehensive image restoration tasks. We utilize a Unet architecture to stack our proposed Omni Selective Scan (OSS) blocks, consisting of an OSS module and an Efficient Feed-Forward Network (EFFN). Our proposed omni selective scan mechanism overcomes the unidirectional modeling limitation of SSMs by efficiently modeling image information flows in all six directions. Furthermore, we conducted a comprehensive evaluation of our VmambaIR across multiple image restoration tasks, including image deraining, single image super-resolution, and real-world image super-resolution. Extensive experimental results demonstrate that our proposed VmambaIR achieves state-of-the-art (SOTA) performance with much fewer computational resources and parameters. Our research highlights the potential of state space models as promising alternatives to the transformer and CNN architectures in serving as foundational frameworks for next-generation low-level visual tasks.
Paper Structure (29 sections, 2 equations, 10 figures, 4 tables)

This paper contains 29 sections, 2 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Our VmambaIR demonstrates outstanding performance by achieving higher accuracy in image restoration tasks while requiring less computational cost. The GFLOPs are computed based on an input image size of $64\times64$. In the real-world super-resolution task, VmambaIR achieves higher reconstruction accuracy with only 26$\%$ of the computational cost.
  • Figure 2: Overview of our VmambaIR. The low-quality image undergoes an initial convolutional processing step to extract shallow features. These features are then fed into a Unet architecture, which is constructed using our proposed OSS block, enabling the extraction and reconstruction of features at various scales. The reconstructed features are subsequently refined through multiple iterations of OSS blocks. Finally, the refined features are passed through a tail block, typically involving convolution or gradual upsampling, to reconstruct the final high-quality image.
  • Figure 3: The architecture of one of our core designs, Omni Selective Scan. In the figure, "H Forward Scan", "W Forward Scan", and "C Forward Scan" indicate scanning from the top left to the bottom right, scanning from the bottom left to the top right on the two-dimensional image plane, and scanning the feature channels from front to back, respectively. The term "Backward" denotes the reverse direction of scanning. For the sake of simplicity in representation, operations on the feature dimensions, such as reshape and permute, have been omitted.
  • Figure 4: The comparison between our proposed omni selective scan and self-attention reveals that Omni selective scan enables the modeling of image features from six directions and possesses linear computational complexity.
  • Figure 5: Visual comparison of single image super-resolution methods. Zoom-in for better details.
  • ...and 5 more figures