Table of Contents
Fetching ...

Fusion-Mamba for Cross-modality Object Detection

Wenhao Dong, Haodong Zhu, Shaohui Lin, Xiaoyan Luo, Yunhang Shen, Xuhui Liu, Juan Zhang, Guodong Guo, Baochang Zhang

TL;DR

This work addresses cross-modality object detection by tackling modality disparities between RGB and IR data. It introduces Fusion-Mamba, a gating-enabled, Mamba-based block that fuses features in a hidden state space using two modules: State Space Channel Swapping for shallow fusion and Dual State Space Fusion for deep fusion. The approach yields state-of-the-art $m$AP on LLVIP, $M^3FD$, and FLIR-Aligned datasets, with notable improvements up to $5.9\%$ on $m$AP and $4.9\%$ on $m$AP$_{50}$, while offering faster inference than Transformer-based methods due to linear complexity $O(N)$. The combination of hidden-state fusion and gated cross-modal interactions enhances representation consistency and robustness under challenging conditions, highlighting Fusion-Mamba’s potential for broader cross-modal tasks and real-time applications.

Abstract

Cross-modality fusing complementary information from different modalities effectively improves object detection performance, making it more useful and robust for a wider range of applications. Existing fusion strategies combine different types of images or merge different backbone features through elaborated neural network modules. However, these methods neglect that modality disparities affect cross-modality fusion performance, as different modalities with different camera focal lengths, placements, and angles are hardly fused. In this paper, we investigate cross-modality fusion by associating cross-modal features in a hidden state space based on an improved Mamba with a gating mechanism. We design a Fusion-Mamba block (FMB) to map cross-modal features into a hidden state space for interaction, thereby reducing disparities between cross-modal features and enhancing the representation consistency of fused features. FMB contains two modules: the State Space Channel Swapping (SSCS) module facilitates shallow feature fusion, and the Dual State Space Fusion (DSSF) enables deep fusion in a hidden state space. Through extensive experiments on public datasets, our proposed approach outperforms the state-of-the-art methods on $m$AP with 5.9% on $M^3FD$ and 4.9% on FLIR-Aligned datasets, demonstrating superior object detection performance. To the best of our knowledge, this is the first work to explore the potential of Mamba for cross-modal fusion and establish a new baseline for cross-modality object detection.

Fusion-Mamba for Cross-modality Object Detection

TL;DR

This work addresses cross-modality object detection by tackling modality disparities between RGB and IR data. It introduces Fusion-Mamba, a gating-enabled, Mamba-based block that fuses features in a hidden state space using two modules: State Space Channel Swapping for shallow fusion and Dual State Space Fusion for deep fusion. The approach yields state-of-the-art AP on LLVIP, , and FLIR-Aligned datasets, with notable improvements up to on AP and on AP, while offering faster inference than Transformer-based methods due to linear complexity . The combination of hidden-state fusion and gated cross-modal interactions enhances representation consistency and robustness under challenging conditions, highlighting Fusion-Mamba’s potential for broader cross-modal tasks and real-time applications.

Abstract

Cross-modality fusing complementary information from different modalities effectively improves object detection performance, making it more useful and robust for a wider range of applications. Existing fusion strategies combine different types of images or merge different backbone features through elaborated neural network modules. However, these methods neglect that modality disparities affect cross-modality fusion performance, as different modalities with different camera focal lengths, placements, and angles are hardly fused. In this paper, we investigate cross-modality fusion by associating cross-modal features in a hidden state space based on an improved Mamba with a gating mechanism. We design a Fusion-Mamba block (FMB) to map cross-modal features into a hidden state space for interaction, thereby reducing disparities between cross-modal features and enhancing the representation consistency of fused features. FMB contains two modules: the State Space Channel Swapping (SSCS) module facilitates shallow feature fusion, and the Dual State Space Fusion (DSSF) enables deep fusion in a hidden state space. Through extensive experiments on public datasets, our proposed approach outperforms the state-of-the-art methods on AP with 5.9% on and 4.9% on FLIR-Aligned datasets, demonstrating superior object detection performance. To the best of our knowledge, this is the first work to explore the potential of Mamba for cross-modal fusion and establish a new baseline for cross-modality object detection.
Paper Structure (16 sections, 13 equations, 11 figures, 7 tables, 1 algorithm)

This paper contains 16 sections, 13 equations, 11 figures, 7 tables, 1 algorithm.

Figures (11)

  • Figure 1: Heatmap visualization. (a) and (b) show the initial RGB and IR input images. (c) and (d) show heatmaps generated from single-modality using YOLOv8. (e) shows the heatmap of YOLO-MS with a CNN-based fusion module. (f) and (g) show heatmaps of ICAFusion and CFT with a transformer-based fusion module. (h) shows the heatmap of our FMB, which achieves better localization.
  • Figure 2: The architecture of the proposed Fusion-Mamba method. The detection network comprises a dual-stream feature extraction network and three Fusion-Mamba blocks (FMB), with the same neck and head as YOLOv8. The top is our detection framework, $\phi_i$ and $\varphi_i$ are the convolutional modules of the RGB and IR branches, which are used to generate features of $F_{R_i}$ and $F_{IR_i}$, respectively. $\hat{F}_{R_i}$ and $\hat{F}_{IR_i}$ are the enhanced feature maps through our FMB. $P_3, P_4$, and $P_5$ are the summation outputs of enhanced feature maps as the feature pyramid inputs for the neck at the last three stages. The bottom shows the design details of our FMB.
  • Figure 3: Illustration of the 2D Selective Scan (SS2D) on a RGB image. Initially, the image undergoes scan expansion, resulting in four distinct feature sequences. Subsequently, each of these sequences is independently processed through the S6 block. Finally, the outputs of the S6 block are combined through scan merging to generate the final 2D feature map.
  • Figure 4: Illustration of the neck and head following Yolov8.
  • Figure 5: Heatmap visualization of various cross-modality object detection methods on LLVIP, $M^3$FD and FLIR-Aligned datasets.
  • ...and 6 more figures