Table of Contents
Fetching ...

Deformable Mamba for Wide Field of View Segmentation

Jie Hu, Junwei Zheng, Jiale Wei, Jiaming Zhang, Rainer Stiefelhagen

TL;DR

This work tackles wide-FoV segmentation by introducing a distortion-aware decoder built on the Mamba architecture. The Deformable Mamba Decoder uses four DMF blocks with a single DS2D-based quadri-directional scan and a DCNv2 deformable conv branch to fuse multi-scale encoder features, delivering distortion-sensitive representations while preserving linear time complexity. It remains compatible with CNN-, Transformer-, and Mamba-based backbones, achieving state-of-the-art results across five wide-FoV benchmarks and substantial efficiency gains in the decoder. The approach broadens practical deployment for panoramic and fisheye scenes in both real and synthetic data, and suggests future work involving multi-modal LLMs to further enhance semantic understanding in wide-FoV contexts.

Abstract

Recent advancements in the Mamba architecture, with its linear computational complexity, being a promising alternative to transformer architectures suffering from quadratic complexity. While existing works primarily focus on adapting Mamba as vision encoders, the critical role of task-specific Mamba decoders remains under-explored, particularly for distortion-prone dense prediction tasks. This paper addresses two interconnected challenges: (1) The design of a Mamba-based decoder that seamlessly adapts to various architectures (e.g., CNN-, Transformer-, and Mamba-based backbones), and (2) The performance degradation in decoders lacking distortion-aware capability when processing wide-FoV images (e.g., 180° fisheye and 360° panoramic settings). We propose the Deformable Mamba Decoder, an efficient distortion-aware decoder that integrates Mamba's computational efficiency with adaptive distortion awareness. Comprehensive experiments on five wide-FoV segmentation benchmarks validate its effectiveness. Notably, our decoder achieves a +2.5% performance improvement on the 360° Stanford2D3D segmentation benchmark while reducing 72% parameters and 97% FLOPs, as compared to the widely-used decoder heads.

Deformable Mamba for Wide Field of View Segmentation

TL;DR

This work tackles wide-FoV segmentation by introducing a distortion-aware decoder built on the Mamba architecture. The Deformable Mamba Decoder uses four DMF blocks with a single DS2D-based quadri-directional scan and a DCNv2 deformable conv branch to fuse multi-scale encoder features, delivering distortion-sensitive representations while preserving linear time complexity. It remains compatible with CNN-, Transformer-, and Mamba-based backbones, achieving state-of-the-art results across five wide-FoV benchmarks and substantial efficiency gains in the decoder. The approach broadens practical deployment for panoramic and fisheye scenes in both real and synthetic data, and suggests future work involving multi-modal LLMs to further enhance semantic understanding in wide-FoV contexts.

Abstract

Recent advancements in the Mamba architecture, with its linear computational complexity, being a promising alternative to transformer architectures suffering from quadratic complexity. While existing works primarily focus on adapting Mamba as vision encoders, the critical role of task-specific Mamba decoders remains under-explored, particularly for distortion-prone dense prediction tasks. This paper addresses two interconnected challenges: (1) The design of a Mamba-based decoder that seamlessly adapts to various architectures (e.g., CNN-, Transformer-, and Mamba-based backbones), and (2) The performance degradation in decoders lacking distortion-aware capability when processing wide-FoV images (e.g., 180° fisheye and 360° panoramic settings). We propose the Deformable Mamba Decoder, an efficient distortion-aware decoder that integrates Mamba's computational efficiency with adaptive distortion awareness. Comprehensive experiments on five wide-FoV segmentation benchmarks validate its effectiveness. Notably, our decoder achieves a +2.5% performance improvement on the 360° Stanford2D3D segmentation benchmark while reducing 72% parameters and 97% FLOPs, as compared to the widely-used decoder heads.

Paper Structure

This paper contains 15 sections, 3 equations, 6 figures, 10 tables.

Figures (6)

  • Figure 1: Our deformable Mamba (a) achieves better results across wide-FoV semantic segmentation while (b) maintaining parameter and computational efficiency.
  • Figure 2: Comparison of camera imaging with narrow-FoV (left) and wide-FoV (right) cameras. Narrow-FoV cameras maintain geometric fidelity but limited coverage, whereas wide-FoV cameras offer expansive scene capture while introducing substantial geometric distortions.
  • Figure 3: Overview of the Deformable Mamba (DMamba) framework. Given wide-FoV images (180° or 360°), the features extracted by an encoder, are fused by the proposed Deformable Mamba Decoder, which constructed by four Deformable Mamba Fusion (DMF) modules.
  • Figure 4: For an embedded 2D sequence, Mamba gu2023mamba employs a uni-directional scan from start to end (1). vim utilizes a bi-directional scan combining (1) and (2), while liu2024vmamba adopts a quadri-directional scanning incorporating (1), (2), (3), and (4).
  • Figure 5: Visualization of the wide-FoV segmentation results. From left to right: one 360° and two 180° segmentation.
  • ...and 1 more figures