Table of Contents
Fetching ...

On Modality Incomplete Infrared-Visible Object Detection: An Architecture Compatibility Perspective

Shuo Yang, Yinghui Xing, Shizhou Zhang, Zhilong Niu

TL;DR

This work tackles modality-incomplete infrared–visible object detection by reframing robustness as an architecture-compatibility problem. It introduces Scarf-DETR, a plug-and-play Scarf Neck built around Modality-Agnostic Deformable Attention (MADA) to flexibly fuse or enhance features from either modality during training and inference, plus a pseudo modality dropout strategy to keep training diverse when modalities may be missing. The authors also create the MI benchmark suite (FLIR-MI, M$^3$FD-MI, LLVIP-MI) to rigorously evaluate detection performance under dominant and secondary missing modalities and across partial modality mixes. Empirically, Scarf-DETR delivers substantial gains in modality-incomplete scenarios (e.g., LLVIP VIS-only improvements up to 55.5% mAP) while remaining competitive on complete-modality tasks across multiple datasets, demonstrating robust cross-modality adaptability with a simple, transferable neck design.

Abstract

Infrared and visible object detection (IVOD) is essential for numerous around-the-clock applications. Despite notable advancements, current IVOD models exhibit notable performance declines when confronted with incomplete modality data, particularly if the dominant modality is missing. In this paper, we take a thorough investigation on modality incomplete IVOD problem from an architecture compatibility perspective. Specifically, we propose a plug-and-play Scarf Neck module for DETR variants, which introduces a modality-agnostic deformable attention mechanism to enable the IVOD detector to flexibly adapt to any single or double modalities during training and inference. When training Scarf-DETR, we design a pseudo modality dropout strategy to fully utilize the multi-modality information, making the detector compatible and robust to both working modes of single and double modalities. Moreover, we introduce a comprehensive benchmark for the modality-incomplete IVOD task aimed at thoroughly assessing situations where the absent modality is either dominant or secondary. Our proposed Scarf-DETR not only performs excellently in missing modality scenarios but also achieves superior performances on the standard IVOD modality complete benchmarks. Our code will be available at https://github.com/YinghuiXing/Scarf-DETR.

On Modality Incomplete Infrared-Visible Object Detection: An Architecture Compatibility Perspective

TL;DR

This work tackles modality-incomplete infrared–visible object detection by reframing robustness as an architecture-compatibility problem. It introduces Scarf-DETR, a plug-and-play Scarf Neck built around Modality-Agnostic Deformable Attention (MADA) to flexibly fuse or enhance features from either modality during training and inference, plus a pseudo modality dropout strategy to keep training diverse when modalities may be missing. The authors also create the MI benchmark suite (FLIR-MI, MFD-MI, LLVIP-MI) to rigorously evaluate detection performance under dominant and secondary missing modalities and across partial modality mixes. Empirically, Scarf-DETR delivers substantial gains in modality-incomplete scenarios (e.g., LLVIP VIS-only improvements up to 55.5% mAP) while remaining competitive on complete-modality tasks across multiple datasets, demonstrating robust cross-modality adaptability with a simple, transferable neck design.

Abstract

Infrared and visible object detection (IVOD) is essential for numerous around-the-clock applications. Despite notable advancements, current IVOD models exhibit notable performance declines when confronted with incomplete modality data, particularly if the dominant modality is missing. In this paper, we take a thorough investigation on modality incomplete IVOD problem from an architecture compatibility perspective. Specifically, we propose a plug-and-play Scarf Neck module for DETR variants, which introduces a modality-agnostic deformable attention mechanism to enable the IVOD detector to flexibly adapt to any single or double modalities during training and inference. When training Scarf-DETR, we design a pseudo modality dropout strategy to fully utilize the multi-modality information, making the detector compatible and robust to both working modes of single and double modalities. Moreover, we introduce a comprehensive benchmark for the modality-incomplete IVOD task aimed at thoroughly assessing situations where the absent modality is either dominant or secondary. Our proposed Scarf-DETR not only performs excellently in missing modality scenarios but also achieves superior performances on the standard IVOD modality complete benchmarks. Our code will be available at https://github.com/YinghuiXing/Scarf-DETR.

Paper Structure

This paper contains 17 sections, 3 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Illustration and necessity to deal with the modality-incomplete issue in IVOD. Available multi-modal detectors fail to detect targets with incomplete modality, while Scarf-DETR is compatible with complete and incomplete modality scenarios.
  • Figure 2: Overview of Scarf Neck. (a) In the complete-modality scenario, the Scarf Neck updates features of both modalities by considering intra-modal enhancement and inter-modal feature interaction.(b) In the modality-missing scenario, the Scarf Neck focuses on intra-modal feature enhancement. (c) Overview of our plug-and-play Scarf Neck.
  • Figure 3: Details of Modality-Agnostic Deformable Attention. In the multi-modality case, we obtain $\{\Delta \boldsymbol{p}^{v}_{q,\cdot}, \boldsymbol{A}^{v}_{q,\cdot}\}$ and $\{\Delta \boldsymbol{p}^{t}_{q,\cdot}, \boldsymbol{A}^{t}_{q,\cdot}\}$ to update visible features and infrared features respectively. If one modality is missing, we obtain $\{\Delta\boldsymbol{p}^S_{q,1},\boldsymbol{A}^S_{q,1}\}$ and $\{\Delta\boldsymbol{p}^S_{q,2},\boldsymbol{A}^S_{q,2}\}$ to update the available features.
  • Figure 4: Illustration of joint training with (a) full modality paired data, (b) vanilla modality dropout strategy, and (c) our proposed pseudo modality dropout strategy.
  • Figure 5: Visualization comparison in three scenarios. In each case, the first row is the visualizations of the model trained without dropout, and the second row is the visualizations of the model with a 60% pseudo dropout ratio. The Scarf Neck combined with pseudo modality dropout strategy greatly improves the model's performance. Best viewed in color and with zoom for clarity.