Deformable Mamba for Wide Field of View Segmentation
Jie Hu, Junwei Zheng, Jiale Wei, Jiaming Zhang, Rainer Stiefelhagen
TL;DR
This work tackles wide-FoV segmentation by introducing a distortion-aware decoder built on the Mamba architecture. The Deformable Mamba Decoder uses four DMF blocks with a single DS2D-based quadri-directional scan and a DCNv2 deformable conv branch to fuse multi-scale encoder features, delivering distortion-sensitive representations while preserving linear time complexity. It remains compatible with CNN-, Transformer-, and Mamba-based backbones, achieving state-of-the-art results across five wide-FoV benchmarks and substantial efficiency gains in the decoder. The approach broadens practical deployment for panoramic and fisheye scenes in both real and synthetic data, and suggests future work involving multi-modal LLMs to further enhance semantic understanding in wide-FoV contexts.
Abstract
Recent advancements in the Mamba architecture, with its linear computational complexity, being a promising alternative to transformer architectures suffering from quadratic complexity. While existing works primarily focus on adapting Mamba as vision encoders, the critical role of task-specific Mamba decoders remains under-explored, particularly for distortion-prone dense prediction tasks. This paper addresses two interconnected challenges: (1) The design of a Mamba-based decoder that seamlessly adapts to various architectures (e.g., CNN-, Transformer-, and Mamba-based backbones), and (2) The performance degradation in decoders lacking distortion-aware capability when processing wide-FoV images (e.g., 180° fisheye and 360° panoramic settings). We propose the Deformable Mamba Decoder, an efficient distortion-aware decoder that integrates Mamba's computational efficiency with adaptive distortion awareness. Comprehensive experiments on five wide-FoV segmentation benchmarks validate its effectiveness. Notably, our decoder achieves a +2.5% performance improvement on the 360° Stanford2D3D segmentation benchmark while reducing 72% parameters and 97% FLOPs, as compared to the widely-used decoder heads.
