Table of Contents
Fetching ...

An Efficient and Mixed Heterogeneous Model for Image Restoration

Yubin Gu, Yuan Meng, Kaihang Zheng, Xiaoshuai Sun, Jiayi Ji, Weijian Ruan, Liujuan Cao, Rongrong Ji

TL;DR

This work tackles the challenge of general-purpose image restoration across diverse degradations by proposing RestorMixer, a three-stage encoder–decoder that fuses CNNs, state-space models (Mamba), and transformer-style attention. The high-resolution stage uses Residual Depth CNN blocks for efficient local feature extraction, while lower-resolution stages employ Enhanced Memory Visual Mamba and Multi-scale Window Self-attention to capture global dependencies and refine features across scales. Through deep supervision and a multi-scale, dual-domain loss, RestorMixer achieves leading performance on rain, snow, SR, and mixed-degradation benchmarks with strong efficiency, demonstrating the benefits of heterogeneous architecture fusion. The approach offers a practical, scalable path for robust IR systems applicable to real-world problems with resource constraints.

Abstract

Image restoration~(IR), as a fundamental multimedia data processing task, has a significant impact on downstream visual applications. In recent years, researchers have focused on developing general-purpose IR models capable of handling diverse degradation types, thereby reducing the cost and complexity of model development. Current mainstream approaches are based on three architectural paradigms: CNNs, Transformers, and Mambas. CNNs excel in efficient inference, whereas Transformers and Mamba excel at capturing long-range dependencies and modeling global contexts. While each architecture has demonstrated success in specialized, single-task settings, limited efforts have been made to effectively integrate heterogeneous architectures to jointly address diverse IR challenges. To bridge this gap, we propose RestorMixer, an efficient and general-purpose IR model based on mixed-architecture fusion. RestorMixer adopts a three-stage encoder-decoder structure, where each stage is tailored to the resolution and feature characteristics of the input. In the initial high-resolution stage, CNN-based blocks are employed to rapidly extract shallow local features. In the subsequent stages, we integrate a refined multi-directional scanning Mamba module with a multi-scale window-based self-attention mechanism. This hierarchical and adaptive design enables the model to leverage the strengths of CNNs in local feature extraction, Mamba in global context modeling, and attention mechanisms in dynamic feature refinement. Extensive experimental results demonstrate that RestorMixer achieves leading performance across multiple IR tasks while maintaining high inference efficiency. The official code can be accessed at https://github.com/ClimBin/RestorMixer.

An Efficient and Mixed Heterogeneous Model for Image Restoration

TL;DR

This work tackles the challenge of general-purpose image restoration across diverse degradations by proposing RestorMixer, a three-stage encoder–decoder that fuses CNNs, state-space models (Mamba), and transformer-style attention. The high-resolution stage uses Residual Depth CNN blocks for efficient local feature extraction, while lower-resolution stages employ Enhanced Memory Visual Mamba and Multi-scale Window Self-attention to capture global dependencies and refine features across scales. Through deep supervision and a multi-scale, dual-domain loss, RestorMixer achieves leading performance on rain, snow, SR, and mixed-degradation benchmarks with strong efficiency, demonstrating the benefits of heterogeneous architecture fusion. The approach offers a practical, scalable path for robust IR systems applicable to real-world problems with resource constraints.

Abstract

Image restoration~(IR), as a fundamental multimedia data processing task, has a significant impact on downstream visual applications. In recent years, researchers have focused on developing general-purpose IR models capable of handling diverse degradation types, thereby reducing the cost and complexity of model development. Current mainstream approaches are based on three architectural paradigms: CNNs, Transformers, and Mambas. CNNs excel in efficient inference, whereas Transformers and Mamba excel at capturing long-range dependencies and modeling global contexts. While each architecture has demonstrated success in specialized, single-task settings, limited efforts have been made to effectively integrate heterogeneous architectures to jointly address diverse IR challenges. To bridge this gap, we propose RestorMixer, an efficient and general-purpose IR model based on mixed-architecture fusion. RestorMixer adopts a three-stage encoder-decoder structure, where each stage is tailored to the resolution and feature characteristics of the input. In the initial high-resolution stage, CNN-based blocks are employed to rapidly extract shallow local features. In the subsequent stages, we integrate a refined multi-directional scanning Mamba module with a multi-scale window-based self-attention mechanism. This hierarchical and adaptive design enables the model to leverage the strengths of CNNs in local feature extraction, Mamba in global context modeling, and attention mechanisms in dynamic feature refinement. Extensive experimental results demonstrate that RestorMixer achieves leading performance across multiple IR tasks while maintaining high inference efficiency. The official code can be accessed at https://github.com/ClimBin/RestorMixer.

Paper Structure

This paper contains 22 sections, 20 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Comparison of three mainstream architectures in terms of input dependency, global view, and inference speed.
  • Figure 2: Comparison of RestorMixer with various representative methods in terms of performance, inference speed, and number of parameters. Testing is at the same input size.
  • Figure 3: Framework of RestorMixer. (a) Overall pipeline. (b) Structure of the M-T Blocks, composed of alternating Enhanced Memory Visual Mamba Blocks (EMVM) and Multi-scale Window Self-Attention (MWSA) Blocks. (c) Residual Depth CNN Block (RDCNN), built with a stack of basic residual convolutional units.
  • Figure 4: Illustration of the M-T Block. It is constructed by alternately stacking EMVM blocks with four-directional scanning and MWSA blocks to jointly capture long-range dependencies and local multi-scale features.
  • Figure 5: Visual comparison with SOTA methods under various rain intensities.
  • ...and 2 more figures