AlignFreeNet: Is Cross-Modal Pre-Alignment Necessary? An End-to-End Alignment-Free Lightweight Network for Visible-Infrared Object Detection
Dingkun Zhu, Haote Zhang, Lipeng Gu, Wuzhou Quan, Fu Lee Wang, Honghui Fan, Jiali Tang, Haoran Xie, Xiaoping Zhang, Mingqiang Wei
TL;DR
This paper tackles cross-modal misalignment in visible-infrared object detection by proposing AlignFreeNet, an alignment-free fusion framework. It introduces two core modules, Variation-guided Cross-Modal Compensation (VCC) and Frequency-guided Cross-modal Fusion (FCF), leveraging Haar wavelet transforms and SS2D to enhance per-modality representations and robustly fuse features without explicit alignment. Across three challenging datasets (DVTOD, M3FD, DroneVehicle), AlignFreeNet achieves state-of-the-art performance under severe mixed misalignments while maintaining lightweight efficiency. The work demonstrates that alignment-free fusion, guided by modality variations and frequency-domain gating, can outperform traditional alignment-based strategies in multimodal perception tasks.
Abstract
Cross-modal misalignments, such as spatial offsets, resolution discrepancies, and semantic deficiencies, frequently occur in visible-infrared object detection (VI-OD). To mitigate this, existing methods are typically adapted into an alignment-based fusion paradigm, in which an explicit pixel- or feature-level alignment module is inserted before cross-modal fusion. However, pixel-level alignment struggles to cope with severe or mixed misalignments, whereas feature-level alignment often introduces undesirable noise into fused representations under such conditions, ultimately limiting detection performance. In this paper, we propose a novel alignment-free network (AlignFreeNet) for VI-OD. Differing from prior methods, AlignFreeNet abandons any explicit alignment and instead adopts an alignment-free fusion paradigm. Specifically, AlignFreeNet comprises two core modules: variation-guided cross-modal compensation (VCC) and frequency-guided cross-modal fusion (FCF). VCC adaptively feeds the compensated information derived from cross-modal discrepancies back into each modality, enhancing visible and infrared representations without the noise caused by explicit alignment. FCF achieves robust cross-modal fusion by suppressing task-irrelevant redundancy via frequency-domain gating, effectively mitigating noise introduced in the process. Moreover, VCC and FCF jointly exploit low- and high-frequency cues to preserve foreground contours in fused representations, effectively mitigating cross-modal blending caused by severe mixed misalignments. Extensive evaluations on DVTOD, M3FD, and DroneVehicle demonstrate that our AlignFreeNet achieves state-of-the-art performance under severe mixed misalignment conditions, highlighting its robustness and generalization.
