Table of Contents
Fetching ...

DMM: Disparity-guided Multispectral Mamba for Oriented Object Detection in Remote Sensing

Minghang Zhou, Tianyu Li, Chaofan Qiao, Dongyu Xie, Guoqing Wang, Ningjuan Ruan, Lin Mei, Yang Yang

TL;DR

DMM tackles inter-modal and intra-modal discrepancies in multispectral oriented object detection by combining a disparity-guided cross-modal fusion Mamba (DCFM) with a multi-scale target-aware attention (MTA) and a Target-Prior Aware (TPA) auxiliary task within a flexible dual-stream detector. The DCFM uses RGB-IR disparity to steer fusion, while MTA enhances RGB features by focusing on target regions across scales, guided by TPA losses that leverage single-modal supervision. Extensive experiments on DroneVehicle and VEDAI show state-of-the-art performance with improved efficiency over transformer-based fusion approaches, and ablations confirm the contribution of each module. The approach achieves strong generalization and scales to high-resolution remote sensing data, indicating practical impact for robust multispectral oriented object detection.

Abstract

Multispectral oriented object detection faces challenges due to both inter-modal and intra-modal discrepancies. Recent studies often rely on transformer-based models to address these issues and achieve cross-modal fusion detection. However, the quadratic computational complexity of transformers limits their performance. Inspired by the efficiency and lower complexity of Mamba in long sequence tasks, we propose Disparity-guided Multispectral Mamba (DMM), a multispectral oriented object detection framework comprised of a Disparity-guided Cross-modal Fusion Mamba (DCFM) module, a Multi-scale Target-aware Attention (MTA) module, and a Target-Prior Aware (TPA) auxiliary task. The DCFM module leverages disparity information between modalities to adaptively merge features from RGB and IR images, mitigating inter-modal conflicts. The MTA module aims to enhance feature representation by focusing on relevant target regions within the RGB modality, addressing intra-modal variations. The TPA auxiliary task utilizes single-modal labels to guide the optimization of the MTA module, ensuring it focuses on targets and their local context. Extensive experiments on the DroneVehicle and VEDAI datasets demonstrate the effectiveness of our method, which outperforms state-of-the-art methods while maintaining computational efficiency. Code will be available at https://github.com/Another-0/DMM.

DMM: Disparity-guided Multispectral Mamba for Oriented Object Detection in Remote Sensing

TL;DR

DMM tackles inter-modal and intra-modal discrepancies in multispectral oriented object detection by combining a disparity-guided cross-modal fusion Mamba (DCFM) with a multi-scale target-aware attention (MTA) and a Target-Prior Aware (TPA) auxiliary task within a flexible dual-stream detector. The DCFM uses RGB-IR disparity to steer fusion, while MTA enhances RGB features by focusing on target regions across scales, guided by TPA losses that leverage single-modal supervision. Extensive experiments on DroneVehicle and VEDAI show state-of-the-art performance with improved efficiency over transformer-based fusion approaches, and ablations confirm the contribution of each module. The approach achieves strong generalization and scales to high-resolution remote sensing data, indicating practical impact for robust multispectral oriented object detection.

Abstract

Multispectral oriented object detection faces challenges due to both inter-modal and intra-modal discrepancies. Recent studies often rely on transformer-based models to address these issues and achieve cross-modal fusion detection. However, the quadratic computational complexity of transformers limits their performance. Inspired by the efficiency and lower complexity of Mamba in long sequence tasks, we propose Disparity-guided Multispectral Mamba (DMM), a multispectral oriented object detection framework comprised of a Disparity-guided Cross-modal Fusion Mamba (DCFM) module, a Multi-scale Target-aware Attention (MTA) module, and a Target-Prior Aware (TPA) auxiliary task. The DCFM module leverages disparity information between modalities to adaptively merge features from RGB and IR images, mitigating inter-modal conflicts. The MTA module aims to enhance feature representation by focusing on relevant target regions within the RGB modality, addressing intra-modal variations. The TPA auxiliary task utilizes single-modal labels to guide the optimization of the MTA module, ensuring it focuses on targets and their local context. Extensive experiments on the DroneVehicle and VEDAI datasets demonstrate the effectiveness of our method, which outperforms state-of-the-art methods while maintaining computational efficiency. Code will be available at https://github.com/Another-0/DMM.
Paper Structure (14 sections, 9 equations, 9 figures, 4 tables)

This paper contains 14 sections, 9 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Modality Disparities between RGB and IR images. Different color bounding boxes represent different categories. Bounding boxes from the RGB modality are represented with solid lines, while dashed lines are from IR modality. (a) shows examples of inter-modal disparities. Target mismatch arises from the varying visibility of targets across different modalities. Category conflict indicates that the differences cause confusion in manual annotation. Due to calibration errors of the capturing devices, paired images are not perfectly aligned. The characteristic of infrared thermal crossover imaging results in ghost shadows in the IR images. (b) shows the intra-modal disparities in RGB images.Uneven illumination can generate a lot of misleading information.C2Former yuan2024c incorrectly focuses on these anomalous regions, whereas our method suppresses background noise and pays more attention to the regions of interest with targets.
  • Figure 2: The overview of our proposed DMM method. The input dual-modal features are first projected into a high-dimensional space through convolution, followed by feature extraction using VSS blocks. Each VSS block is cascaded with a downsampling layer to reduce the feature map size. The features of different sizes generated by the VSS blocks in the upper stream and lower stream are fed into the MTA module and the DCFM module, respectively. The output of the MTA is fed into the TPA head to assess the quality of the MTA-enhanced features. The output of the DCFM is directed to the subsequent detection head. Given the critical role of the FPN structure in both one-stage and two-stage algorithms, it is included as part of the detection head architecture. On the far right, we present the structure of the VSS block within the backbone and the SS2D mechanism at the lowest module level, derived from the V9 architecture of VMamba.
  • Figure 3: The structures of our proposed DCFM Module. The DCFM Module projects RGB and IR features to higher dimensions, and combines them using DSSM. The Channel Attention Block (CAB) enhances feature representation and the Disparity-guided Selective Scan Module (DSSM) Refines and merges features.
  • Figure 4: The structure of our proposed MTA Module. The outputs are fed to both the DCFM module for high-precision fusion and the TPA module for loss calculation.
  • Figure 5: Visualization of prediction results on the DroneVehicle dataset, with a confidence threshold set to 0.5. The detection boxes in RGB and IR represent ground truths. The comparison of detection results within the blue dashed circles indicates that our method demonstrates superior visual performance for each category.
  • ...and 4 more figures