Table of Contents
Fetching ...

MM-DETR: An Efficient Multimodal Detection Transformer with Mamba-Driven Dual-Granularity Fusion and Frequency-Aware Modality Adapters

Jianhong Han, Yupei Wang, Yuan Zhang, Liang Chen

TL;DR

This work tackles multimodal remote sensing object detection by addressing the high fusion cost and parameter burden of dual-stream backbones. It introduces MM-DETR, featuring a Mamba-based Dual-granularity Fusion Encoder (MDF-Encoder) that enables channel-wise, linear-complexity cross-modal interaction and a region-aware modality-completion branch for fine-grained fusion. A Lightweight Frequency-aware Modality Adapter (LFM-Adapter) replaces dual backbones with a shared backbone while capturing modality-specific cues through spatial and frequency experts balanced by a pixel-wise router. Extensive experiments on four public datasets demonstrate strong accuracy with an efficient design, confirming the method’s robustness and practical deployment potential.

Abstract

Multimodal remote sensing object detection aims to achieve more accurate and robust perception under challenging conditions by fusing complementary information from different modalities. However, existing approaches that rely on attention-based or deformable convolution fusion blocks still struggle to balance performance and lightweight design. Beyond fusion complexity, extracting modality features with shared backbones yields suboptimal representations due to insufficient modality-specific modeling, whereas dual-stream architectures nearly double the parameter count, ultimately limiting practical deployment. To this end, we propose MM-DETR, a lightweight and efficient framework for multimodal object detection. Specifically, we propose a Mamba-based dual granularity fusion encoder that reformulates global interaction as channel-wise dynamic gating and leverages a 1D selective scan for efficient cross-modal modeling with linear complexity. Following this design, we further reinterpret multimodal fusion as a modality completion problem. A region-aware 2D selective scanning completion branch is introduced to recover modality-specific cues, supporting fine-grained fusion along a bidirectional pyramid pathway with minimal overhead. To further reduce parameter redundancy while retaining strong feature extraction capability, a lightweight frequency-aware modality adapter is inserted into the shared backbone. This adapter employs a spatial-frequency co-expert structure to capture modality-specific cues, while a pixel-wise router dynamically balances expert contributions for efficient spatial-frequency fusion. Extensive experiments conducted on four multimodal benchmark datasets demonstrate the effectiveness and generalization capability of the proposed method.

MM-DETR: An Efficient Multimodal Detection Transformer with Mamba-Driven Dual-Granularity Fusion and Frequency-Aware Modality Adapters

TL;DR

This work tackles multimodal remote sensing object detection by addressing the high fusion cost and parameter burden of dual-stream backbones. It introduces MM-DETR, featuring a Mamba-based Dual-granularity Fusion Encoder (MDF-Encoder) that enables channel-wise, linear-complexity cross-modal interaction and a region-aware modality-completion branch for fine-grained fusion. A Lightweight Frequency-aware Modality Adapter (LFM-Adapter) replaces dual backbones with a shared backbone while capturing modality-specific cues through spatial and frequency experts balanced by a pixel-wise router. Extensive experiments on four public datasets demonstrate strong accuracy with an efficient design, confirming the method’s robustness and practical deployment potential.

Abstract

Multimodal remote sensing object detection aims to achieve more accurate and robust perception under challenging conditions by fusing complementary information from different modalities. However, existing approaches that rely on attention-based or deformable convolution fusion blocks still struggle to balance performance and lightweight design. Beyond fusion complexity, extracting modality features with shared backbones yields suboptimal representations due to insufficient modality-specific modeling, whereas dual-stream architectures nearly double the parameter count, ultimately limiting practical deployment. To this end, we propose MM-DETR, a lightweight and efficient framework for multimodal object detection. Specifically, we propose a Mamba-based dual granularity fusion encoder that reformulates global interaction as channel-wise dynamic gating and leverages a 1D selective scan for efficient cross-modal modeling with linear complexity. Following this design, we further reinterpret multimodal fusion as a modality completion problem. A region-aware 2D selective scanning completion branch is introduced to recover modality-specific cues, supporting fine-grained fusion along a bidirectional pyramid pathway with minimal overhead. To further reduce parameter redundancy while retaining strong feature extraction capability, a lightweight frequency-aware modality adapter is inserted into the shared backbone. This adapter employs a spatial-frequency co-expert structure to capture modality-specific cues, while a pixel-wise router dynamically balances expert contributions for efficient spatial-frequency fusion. Extensive experiments conducted on four multimodal benchmark datasets demonstrate the effectiveness and generalization capability of the proposed method.

Paper Structure

This paper contains 37 sections, 21 equations, 5 figures, 10 tables.

Figures (5)

  • Figure 1: Overall architecture of the proposed MM-DETR. Given paired RGB and IR images, a shared RT-DETR backbone equipped with lightweight frequency-aware modality adapters extracts modality-specific multi-scale features for both streams. These features are then fed into the Mamba-based Dual-granularity Fusion Encoder (MDF-Encoder). The MDF-Encoder first applies the commonality-enhancing interaction module to perform global cross-modal interaction, producing enhanced modality-shared representations. Subsequently, the modality-completion pyramid fusion module performs complementary fusion along a bidirectional pyramid, where the dashed arrows indicate an optional RGB-side completion path. Finally, the fused multi-scale tokens are passed to the transformer-decoder and detection head to generate the final predictions.
  • Figure 2: Detailed architecture of the proposed Mamba-based Dual-Granularity Fusion Encoder (MDF-Encoder). At each pyramid level, the Commonality-Enhancing Interaction (CEI) module first performs channel-wise cross-modal interaction to reinforce modality-shared representations, producing the fused features. As illustrated by the dashed paths, an additional modality-completion branch is incorporated, where the Region-aware SS2D module enhances modality-specific residual cues from the IR stream and processes them through a lightweight gated mechanism to generate completion features. These completion cues are subsequently injected into both the top–down (FPN) and bottom–up (PAN) pathways via dedicated fusion blocks, enabling fine-grained complementary fusion across scales.
  • Figure 3: Illustration of the proposed Lightweight Frequency-Aware Modality Adapter (LFM-Adapter). Left: the spatial expert structure. Right: the frequency expert structure. Middle: a pixel-wise router predicts adaptive weights to fuse the outputs of the spatial, low-frequency, and high-frequency experts.
  • Figure 4: Parameter sensitivity analysis of adapter projector dimension. Results on multiple multimodal benchmarks demonstrate that the proposed LFM-Adapter remains robust across different projector dimensions.
  • Figure 5: Our detection results are compared with state-of-the-art methods. Different object categories are marked using distinct colors, and the confidence threshold for visualization is set to 0.3. “Single-modal” denotes the base detector trained using only single-modality data, whereas the other labels correspond to their respective methods. “GT” indicates the ground-truth annotations.