Table of Contents
Fetching ...

RemoteDet-Mamba: A Hybrid Mamba-CNN Network for Multi-modal Object Detection in Remote Sensing Images

Kejun Ren, Xin Wu, Lianming Xu, Li Wang

TL;DR

Multimodal UAV remote sensing object detection is challenged by small, densely distributed targets and low inter-class discriminability. The paper introduces RemoteDet-Mamba, a hybrid CNN-Mamba architecture that combines a Siamese CNN encoder for modality-specific local features with a Cross-modal Fusion Mamba (CFM) module, which employs four-directional patch-level SS2D scanning to enable efficient, global cross-modal fusion with linear time complexity. The loss framework combines $L_{total} = \alpha L_{box} + \beta L_{obj} + \gamma L_{cls} + \delta L_{theta}$ with $\alpha=0.05$, $\beta=1.0$, $\gamma=0.5$, $\delta=0.5$, where $L_{box}$ is CIoU, $L_{cls}$ and $L_{obj}$ use Smooth BCE, and $L_{theta}$ uses CSL with a Gaussian window. On the DroneVehicle dataset, RemoteDet-Mamba achieves ~81.8% mAP with ~71.3M parameters and ~24 FPS, surpassing state-of-the-art methods while maintaining low computational overhead, signaling strong practicality for real-time multimodal remote sensing applications.

Abstract

Unmanned Aerial Vehicle (UAV) remote sensing, with its advantages of rapid information acquisition and low cost, has been widely applied in scenarios such as emergency response. However, due to the long imaging distance and complex imaging mechanisms, targets in remote sensing images often face challenges such as small object size, dense distribution, and low inter-class discriminability. To address these issues, this paper proposes a multi-modal remote sensing object detection network called RemoteDet-Mamba, which is based on a patch-level four-direction selective scanning fusion strategy. This method simultaneously learns unimodal local features and fuses cross-modal patch-level global semantic information, thereby enhancing the distinguishability of small objects and improving inter-class discrimination. Furthermore, the designed lightweight fusion mechanism effectively decouples densely packed targets while reducing computational complexity. Experimental results on the DroneVehicle dataset demonstrate that RemoteDet-Mamba achieves superior detection performance compared to current mainstream methods, while maintaining low parameter count and computational overhead, showing promising potential for practical applications.

RemoteDet-Mamba: A Hybrid Mamba-CNN Network for Multi-modal Object Detection in Remote Sensing Images

TL;DR

Multimodal UAV remote sensing object detection is challenged by small, densely distributed targets and low inter-class discriminability. The paper introduces RemoteDet-Mamba, a hybrid CNN-Mamba architecture that combines a Siamese CNN encoder for modality-specific local features with a Cross-modal Fusion Mamba (CFM) module, which employs four-directional patch-level SS2D scanning to enable efficient, global cross-modal fusion with linear time complexity. The loss framework combines with , , , , where is CIoU, and use Smooth BCE, and uses CSL with a Gaussian window. On the DroneVehicle dataset, RemoteDet-Mamba achieves ~81.8% mAP with ~71.3M parameters and ~24 FPS, surpassing state-of-the-art methods while maintaining low computational overhead, signaling strong practicality for real-time multimodal remote sensing applications.

Abstract

Unmanned Aerial Vehicle (UAV) remote sensing, with its advantages of rapid information acquisition and low cost, has been widely applied in scenarios such as emergency response. However, due to the long imaging distance and complex imaging mechanisms, targets in remote sensing images often face challenges such as small object size, dense distribution, and low inter-class discriminability. To address these issues, this paper proposes a multi-modal remote sensing object detection network called RemoteDet-Mamba, which is based on a patch-level four-direction selective scanning fusion strategy. This method simultaneously learns unimodal local features and fuses cross-modal patch-level global semantic information, thereby enhancing the distinguishability of small objects and improving inter-class discrimination. Furthermore, the designed lightweight fusion mechanism effectively decouples densely packed targets while reducing computational complexity. Experimental results on the DroneVehicle dataset demonstrate that RemoteDet-Mamba achieves superior detection performance compared to current mainstream methods, while maintaining low parameter count and computational overhead, showing promising potential for practical applications.

Paper Structure

This paper contains 9 sections, 13 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: GFLOPS comparison with varying input sizes (H = W); C2Former omitted beyond 1100×1100 due to quadratic memory growth.
  • Figure 2: The architecture of the RemoteDet-Mamba framework. The top portion is the outline of the RemoteDet-Mamba. The bottom section provides a detailed view of our CFM module.
  • Figure 3: The visual ground truth boxes under different GT forms.
  • Figure 4: Detection results in (a-1)–(d-1), (a-2)–(d-2), and (a-3)–(d-3) correspond to RGB-only, TIR-only, and our RemoteDet-Mamba, respectively, on the DroneVehicle dataset.