Table of Contents
Fetching ...

Multi-dimensional Visual Prompt Enhanced Image Restoration via Mamba-Transformer Aggregation

Aiwen Jiang, Hourong Chen, Zhiwen Chen, Jihua Ye, Mingwen Wang

TL;DR

The paper addresses the challenge of robust all-in-one image restoration across multiple degradations by fusing Mamba's linear, spatial modeling with Transformer-style channel attention, complemented by multi-dimensional learned prompts. It introduces MTAIR, a hybrid Mamba-Transformer architecture with a plug-in Spatial-Channel Prompt Block that learns degradation-aware prompts across scales and interacts with features via cross-attention. The approach achieves state-of-the-art performance on denoising, deraining, and dehazing benchmarks while maintaining computational efficiency, and the prompts are designed to be easily integrated into other networks. This work advances practical multi-degradation restoration for real-world vision systems by enabling adaptive, cross-dimensional feature fusion guided by learnable prompts.

Abstract

Recent efforts on image restoration have focused on developing "all-in-one" models that can handle different degradation types and levels within single model. However, most of mainstream Transformer-based ones confronted with dilemma between model capabilities and computation burdens, since self-attention mechanism quadratically increase in computational complexity with respect to image size, and has inadequacies in capturing long-range dependencies. Most of Mamba-related ones solely scanned feature map in spatial dimension for global modeling, failing to fully utilize information in channel dimension. To address aforementioned problems, this paper has proposed to fully utilize complementary advantages from Mamba and Transformer without sacrificing computation efficiency. Specifically, the selective scanning mechanism of Mamba is employed to focus on spatial modeling, enabling capture long-range spatial dependencies under linear complexity. The self-attention mechanism of Transformer is applied to focus on channel modeling, avoiding high computation burdens that are in quadratic growth with image's spatial dimensions. Moreover, to enrich informative prompts for effective image restoration, multi-dimensional prompt learning modules are proposed to learn prompt-flows from multi-scale encoder/decoder layers, benefiting for revealing underlying characteristic of various degradations from both spatial and channel perspectives, therefore, enhancing the capabilities of "all-in-one" model to solve various restoration tasks. Extensive experiment results on several image restoration benchmark tasks such as image denoising, dehazing, and deraining, have demonstrated that the proposed method can achieve new state-of-the-art performance, compared with many popular mainstream methods. Related source codes and pre-trained parameters will be public on github https://github.com/12138-chr/MTAIR.

Multi-dimensional Visual Prompt Enhanced Image Restoration via Mamba-Transformer Aggregation

TL;DR

The paper addresses the challenge of robust all-in-one image restoration across multiple degradations by fusing Mamba's linear, spatial modeling with Transformer-style channel attention, complemented by multi-dimensional learned prompts. It introduces MTAIR, a hybrid Mamba-Transformer architecture with a plug-in Spatial-Channel Prompt Block that learns degradation-aware prompts across scales and interacts with features via cross-attention. The approach achieves state-of-the-art performance on denoising, deraining, and dehazing benchmarks while maintaining computational efficiency, and the prompts are designed to be easily integrated into other networks. This work advances practical multi-degradation restoration for real-world vision systems by enabling adaptive, cross-dimensional feature fusion guided by learnable prompts.

Abstract

Recent efforts on image restoration have focused on developing "all-in-one" models that can handle different degradation types and levels within single model. However, most of mainstream Transformer-based ones confronted with dilemma between model capabilities and computation burdens, since self-attention mechanism quadratically increase in computational complexity with respect to image size, and has inadequacies in capturing long-range dependencies. Most of Mamba-related ones solely scanned feature map in spatial dimension for global modeling, failing to fully utilize information in channel dimension. To address aforementioned problems, this paper has proposed to fully utilize complementary advantages from Mamba and Transformer without sacrificing computation efficiency. Specifically, the selective scanning mechanism of Mamba is employed to focus on spatial modeling, enabling capture long-range spatial dependencies under linear complexity. The self-attention mechanism of Transformer is applied to focus on channel modeling, avoiding high computation burdens that are in quadratic growth with image's spatial dimensions. Moreover, to enrich informative prompts for effective image restoration, multi-dimensional prompt learning modules are proposed to learn prompt-flows from multi-scale encoder/decoder layers, benefiting for revealing underlying characteristic of various degradations from both spatial and channel perspectives, therefore, enhancing the capabilities of "all-in-one" model to solve various restoration tasks. Extensive experiment results on several image restoration benchmark tasks such as image denoising, dehazing, and deraining, have demonstrated that the proposed method can achieve new state-of-the-art performance, compared with many popular mainstream methods. Related source codes and pre-trained parameters will be public on github https://github.com/12138-chr/MTAIR.

Paper Structure

This paper contains 23 sections, 6 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Overview of the MTAIR. It consists of a multi-stage encoder-decoder network(M-T DHB or TB.) and multi-stage S-C Prompt Block.
  • Figure 2: Overview of the M-T DHB.(a) M-T DN:Mamba-Transformer Doublebranch Network. (b) M-T DIM:Mamba-Transformer Dual Interaction Module. (c) Vision State-Space Module. (d) Channel Attention module. (e) S-C module. (f) C-S module.
  • Figure 3: The scanning route consist of four directions: from the top-left to the bottom-right, from the bottom-right to the top-left, from the top-right to the top-left, and from the bottom-left to the top-right.
  • Figure 4: (a) Overview of the proposed S-C Prompt Block. (b) PAM:Prompt Attention Module. (c) S-C PIM: Spatial-Channel Prompt Interaction Module.
  • Figure 5: Visual comparisons with SOTA all-in-one models on Rain100Lyang2017deep, SOTSli2018benchmarking and CBSD68martin2001database sample images. The proposed model exhibits better degradation removal.
  • ...and 1 more figures