Table of Contents
Fetching ...

U-shaped Vision Mamba for Single Image Dehazing

Zhuoran Zheng, Chen Wu

TL;DR

This work tackles single-image dehazing under resource constraints by fusing local CNN features with efficient long-range modeling via State Space Sequence Models (SSMs) in a U-shaped encoder–decoder called UVM-Net. The Bi-SSM module employs dual SSM branches to capture global context across channels while preserving local detail, enabling fast inference (roughly 100 FPS) on hazy images. Experimental results on RESIDE datasets show state-of-the-art-like performance with favorable efficiency, and ablations verify the Bi-SSM's contribution. The method also generalizes to other restoration tasks such as low-light enhancement and deraining, suggesting a versatile backbone for future image restoration advances.

Abstract

Currently, Transformer is the most popular architecture for image dehazing, but due to its large computational complexity, its ability to handle long-range dependency is limited on resource-constrained devices. To tackle this challenge, we introduce the U-shaped Vision Mamba (UVM-Net), an efficient single-image dehazing network. Inspired by the State Space Sequence Models (SSMs), a new deep sequence model known for its power to handle long sequences, we design a Bi-SSM block that integrates the local feature extraction ability of the convolutional layer with the ability of the SSM to capture long-range dependencies. Extensive experimental results demonstrate the effectiveness of our method. Our method provides a more highly efficient idea of long-range dependency modeling for image dehazing as well as other image restoration tasks. The URL of the code is \url{https://github.com/zzr-idam/UVM-Net}. Our method takes only \textbf{0.009} seconds to infer a $325 \times 325$ resolution image (100FPS) without I/O handling time.

U-shaped Vision Mamba for Single Image Dehazing

TL;DR

This work tackles single-image dehazing under resource constraints by fusing local CNN features with efficient long-range modeling via State Space Sequence Models (SSMs) in a U-shaped encoder–decoder called UVM-Net. The Bi-SSM module employs dual SSM branches to capture global context across channels while preserving local detail, enabling fast inference (roughly 100 FPS) on hazy images. Experimental results on RESIDE datasets show state-of-the-art-like performance with favorable efficiency, and ablations verify the Bi-SSM's contribution. The method also generalizes to other restoration tasks such as low-light enhancement and deraining, suggesting a versatile backbone for future image restoration advances.

Abstract

Currently, Transformer is the most popular architecture for image dehazing, but due to its large computational complexity, its ability to handle long-range dependency is limited on resource-constrained devices. To tackle this challenge, we introduce the U-shaped Vision Mamba (UVM-Net), an efficient single-image dehazing network. Inspired by the State Space Sequence Models (SSMs), a new deep sequence model known for its power to handle long sequences, we design a Bi-SSM block that integrates the local feature extraction ability of the convolutional layer with the ability of the SSM to capture long-range dependencies. Extensive experimental results demonstrate the effectiveness of our method. Our method provides a more highly efficient idea of long-range dependency modeling for image dehazing as well as other image restoration tasks. The URL of the code is \url{https://github.com/zzr-idam/UVM-Net}. Our method takes only \textbf{0.009} seconds to infer a resolution image (100FPS) without I/O handling time.
Paper Structure (10 sections, 2 figures, 4 tables)

This paper contains 10 sections, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Overview of the UVM-Net architecture. UVM-Net employs the encoder-decoder framework with UVM-Net blocks in the encoder and convolution blocks in the decoder, together with skip connections. In UVM-Net block, our feature maps are first applied to a convolution operation, then the unfolded pixels are modeled over SSM, and the size of the final feature is reshaped to the size of the input information.
  • Figure 2: Qualitative comparison of image dehazing methods on SOTS mix set, where the first rows are outdoor images, and the second row is indoor images. The third and fourth rows are real-world images. The first column is the hazy images and the last is the corresponding ground truth.