RemoteDet-Mamba: A Hybrid Mamba-CNN Network for Multi-modal Object Detection in Remote Sensing Images
Kejun Ren, Xin Wu, Lianming Xu, Li Wang
TL;DR
Multimodal UAV remote sensing object detection is challenged by small, densely distributed targets and low inter-class discriminability. The paper introduces RemoteDet-Mamba, a hybrid CNN-Mamba architecture that combines a Siamese CNN encoder for modality-specific local features with a Cross-modal Fusion Mamba (CFM) module, which employs four-directional patch-level SS2D scanning to enable efficient, global cross-modal fusion with linear time complexity. The loss framework combines $L_{total} = \alpha L_{box} + \beta L_{obj} + \gamma L_{cls} + \delta L_{theta}$ with $\alpha=0.05$, $\beta=1.0$, $\gamma=0.5$, $\delta=0.5$, where $L_{box}$ is CIoU, $L_{cls}$ and $L_{obj}$ use Smooth BCE, and $L_{theta}$ uses CSL with a Gaussian window. On the DroneVehicle dataset, RemoteDet-Mamba achieves ~81.8% mAP with ~71.3M parameters and ~24 FPS, surpassing state-of-the-art methods while maintaining low computational overhead, signaling strong practicality for real-time multimodal remote sensing applications.
Abstract
Unmanned Aerial Vehicle (UAV) remote sensing, with its advantages of rapid information acquisition and low cost, has been widely applied in scenarios such as emergency response. However, due to the long imaging distance and complex imaging mechanisms, targets in remote sensing images often face challenges such as small object size, dense distribution, and low inter-class discriminability. To address these issues, this paper proposes a multi-modal remote sensing object detection network called RemoteDet-Mamba, which is based on a patch-level four-direction selective scanning fusion strategy. This method simultaneously learns unimodal local features and fuses cross-modal patch-level global semantic information, thereby enhancing the distinguishability of small objects and improving inter-class discrimination. Furthermore, the designed lightweight fusion mechanism effectively decouples densely packed targets while reducing computational complexity. Experimental results on the DroneVehicle dataset demonstrate that RemoteDet-Mamba achieves superior detection performance compared to current mainstream methods, while maintaining low parameter count and computational overhead, showing promising potential for practical applications.
