On Modality Incomplete Infrared-Visible Object Detection: An Architecture Compatibility Perspective
Shuo Yang, Yinghui Xing, Shizhou Zhang, Zhilong Niu
TL;DR
This work tackles modality-incomplete infrared–visible object detection by reframing robustness as an architecture-compatibility problem. It introduces Scarf-DETR, a plug-and-play Scarf Neck built around Modality-Agnostic Deformable Attention (MADA) to flexibly fuse or enhance features from either modality during training and inference, plus a pseudo modality dropout strategy to keep training diverse when modalities may be missing. The authors also create the MI benchmark suite (FLIR-MI, M$^3$FD-MI, LLVIP-MI) to rigorously evaluate detection performance under dominant and secondary missing modalities and across partial modality mixes. Empirically, Scarf-DETR delivers substantial gains in modality-incomplete scenarios (e.g., LLVIP VIS-only improvements up to 55.5% mAP) while remaining competitive on complete-modality tasks across multiple datasets, demonstrating robust cross-modality adaptability with a simple, transferable neck design.
Abstract
Infrared and visible object detection (IVOD) is essential for numerous around-the-clock applications. Despite notable advancements, current IVOD models exhibit notable performance declines when confronted with incomplete modality data, particularly if the dominant modality is missing. In this paper, we take a thorough investigation on modality incomplete IVOD problem from an architecture compatibility perspective. Specifically, we propose a plug-and-play Scarf Neck module for DETR variants, which introduces a modality-agnostic deformable attention mechanism to enable the IVOD detector to flexibly adapt to any single or double modalities during training and inference. When training Scarf-DETR, we design a pseudo modality dropout strategy to fully utilize the multi-modality information, making the detector compatible and robust to both working modes of single and double modalities. Moreover, we introduce a comprehensive benchmark for the modality-incomplete IVOD task aimed at thoroughly assessing situations where the absent modality is either dominant or secondary. Our proposed Scarf-DETR not only performs excellently in missing modality scenarios but also achieves superior performances on the standard IVOD modality complete benchmarks. Our code will be available at https://github.com/YinghuiXing/Scarf-DETR.
