Table of Contents
Fetching ...

DAMSDet: Dynamic Adaptive Multispectral Detection Transformer with Competitive Query Selection and Adaptive Feature Fusion

Junjie Guo, Chenqiang Gao, Fangcen Liu, Deyu Meng, Xinbo Gao

TL;DR

This work tackles the challenge of robust infrared-visible object detection under dynamically varying modality complementarity and misalignment. It introduces DAMSDet, featuring Modality Competitive Query Selection to assign initial queries to the dominant modality per object and a Multispectral Deformable Cross-attention mechanism that fuses cross-modal features across multiple semantic levels within a cascade DETR framework. Comprehensive experiments on four public datasets demonstrate substantial improvements over state-of-the-art methods and validate the effectiveness of both MCQS and MDCA through ablation studies. The proposed approach enhances full-day detection performance and robustness to misalignment, with practical implications for multispectral sensing applications.

Abstract

Infrared-visible object detection aims to achieve robust even full-day object detection by fusing the complementary information of infrared and visible images. However, highly dynamically variable complementary characteristics and commonly existing modality misalignment make the fusion of complementary information difficult. In this paper, we propose a Dynamic Adaptive Multispectral Detection Transformer (DAMSDet) to simultaneously address these two challenges. Specifically, we propose a Modality Competitive Query Selection strategy to provide useful prior information. This strategy can dynamically select basic salient modality feature representation for each object. To effectively mine the complementary information and adapt to misalignment situations, we propose a Multispectral Deformable Cross-attention module to adaptively sample and aggregate multi-semantic level features of infrared and visible images for each object. In addition, we further adopt the cascade structure of DETR to better mine complementary information. Experiments on four public datasets of different scenes demonstrate significant improvements compared to other state-of-the-art methods. The code will be released at https://github.com/gjj45/DAMSDet.

DAMSDet: Dynamic Adaptive Multispectral Detection Transformer with Competitive Query Selection and Adaptive Feature Fusion

TL;DR

This work tackles the challenge of robust infrared-visible object detection under dynamically varying modality complementarity and misalignment. It introduces DAMSDet, featuring Modality Competitive Query Selection to assign initial queries to the dominant modality per object and a Multispectral Deformable Cross-attention mechanism that fuses cross-modal features across multiple semantic levels within a cascade DETR framework. Comprehensive experiments on four public datasets demonstrate substantial improvements over state-of-the-art methods and validate the effectiveness of both MCQS and MDCA through ablation studies. The proposed approach enhances full-day detection performance and robustness to misalignment, with practical implications for multispectral sensing applications.

Abstract

Infrared-visible object detection aims to achieve robust even full-day object detection by fusing the complementary information of infrared and visible images. However, highly dynamically variable complementary characteristics and commonly existing modality misalignment make the fusion of complementary information difficult. In this paper, we propose a Dynamic Adaptive Multispectral Detection Transformer (DAMSDet) to simultaneously address these two challenges. Specifically, we propose a Modality Competitive Query Selection strategy to provide useful prior information. This strategy can dynamically select basic salient modality feature representation for each object. To effectively mine the complementary information and adapt to misalignment situations, we propose a Multispectral Deformable Cross-attention module to adaptively sample and aggregate multi-semantic level features of infrared and visible images for each object. In addition, we further adopt the cascade structure of DETR to better mine complementary information. Experiments on four public datasets of different scenes demonstrate significant improvements compared to other state-of-the-art methods. The code will be released at https://github.com/gjj45/DAMSDet.
Paper Structure (14 sections, 4 equations, 6 figures, 5 tables)

This paper contains 14 sections, 4 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Illustrations of two typical challenges in infrared-visible object detection. (a) Three pedestrians represent different complex complementary characteristics. In this example, the objects in the visible image provide unuseful interference information (red), partial complementary information (blue), and full complementary information (green). (b) One example of the misalignment problem, in which the ground truths of infrared and visible objects appear obvious dislocation. This misalignment commonly happens in infrared-visible images. We propose a Multispectral Transformer Decoder with Multispectral Deformable Cross-attention module to simultaneously address these two typical challenges.
  • Figure 2: Overview of DAMSDet. Our DAMSDet comprises four main components: two modality-specific CNN backbones to extract features, two modality-specific Efficient Encoders lv2023detrs to encode features, a Modality Competitive Query Selection module to select initial object queries, and a Multispectral Transformer Decoder to mine complementary information and refine queries.
  • Figure 3: Visualization of Modality Competitive Query Selection results. Red points indicate high-score queries selected in the corresponding modality image, while blue points represent lower-scoring queries. The red boxes indicate the objects represented by high-score queries.
  • Figure 4: The structure of the Multispectral Transformer Decoder (DeNoising Training Group is omitted in the figure) and Multispectral Deformable cross-attention module.
  • Figure 5: Visualization of feature sampling at different semantic levels in different decoder layers. Different colors of points represent the results of sampling in different semantic layers, where blue, green, and red represent sampling points on low-level, middle-level, and high-level semantic features maps respectively. Brightly colored and large points indicate relatively high attention weights. (a) The green boxes in the visible image represent aligned bounding boxes, which show the sampling points in each modality are concentrated on the right instance locations. (b) Objects occluded by background and smoke tend to predominantly focus on the infrared modality at subsequent decoder layers. (c) Objects in good illumination conditions and those less distinguishable in the infrared modality tend to predominantly focus on the visible modality at subsequent decoder layers.
  • ...and 1 more figures