Table of Contents
Fetching ...

AlignFreeNet: Is Cross-Modal Pre-Alignment Necessary? An End-to-End Alignment-Free Lightweight Network for Visible-Infrared Object Detection

Dingkun Zhu, Haote Zhang, Lipeng Gu, Wuzhou Quan, Fu Lee Wang, Honghui Fan, Jiali Tang, Haoran Xie, Xiaoping Zhang, Mingqiang Wei

TL;DR

This paper tackles cross-modal misalignment in visible-infrared object detection by proposing AlignFreeNet, an alignment-free fusion framework. It introduces two core modules, Variation-guided Cross-Modal Compensation (VCC) and Frequency-guided Cross-modal Fusion (FCF), leveraging Haar wavelet transforms and SS2D to enhance per-modality representations and robustly fuse features without explicit alignment. Across three challenging datasets (DVTOD, M3FD, DroneVehicle), AlignFreeNet achieves state-of-the-art performance under severe mixed misalignments while maintaining lightweight efficiency. The work demonstrates that alignment-free fusion, guided by modality variations and frequency-domain gating, can outperform traditional alignment-based strategies in multimodal perception tasks.

Abstract

Cross-modal misalignments, such as spatial offsets, resolution discrepancies, and semantic deficiencies, frequently occur in visible-infrared object detection (VI-OD). To mitigate this, existing methods are typically adapted into an alignment-based fusion paradigm, in which an explicit pixel- or feature-level alignment module is inserted before cross-modal fusion. However, pixel-level alignment struggles to cope with severe or mixed misalignments, whereas feature-level alignment often introduces undesirable noise into fused representations under such conditions, ultimately limiting detection performance. In this paper, we propose a novel alignment-free network (AlignFreeNet) for VI-OD. Differing from prior methods, AlignFreeNet abandons any explicit alignment and instead adopts an alignment-free fusion paradigm. Specifically, AlignFreeNet comprises two core modules: variation-guided cross-modal compensation (VCC) and frequency-guided cross-modal fusion (FCF). VCC adaptively feeds the compensated information derived from cross-modal discrepancies back into each modality, enhancing visible and infrared representations without the noise caused by explicit alignment. FCF achieves robust cross-modal fusion by suppressing task-irrelevant redundancy via frequency-domain gating, effectively mitigating noise introduced in the process. Moreover, VCC and FCF jointly exploit low- and high-frequency cues to preserve foreground contours in fused representations, effectively mitigating cross-modal blending caused by severe mixed misalignments. Extensive evaluations on DVTOD, M3FD, and DroneVehicle demonstrate that our AlignFreeNet achieves state-of-the-art performance under severe mixed misalignment conditions, highlighting its robustness and generalization.

AlignFreeNet: Is Cross-Modal Pre-Alignment Necessary? An End-to-End Alignment-Free Lightweight Network for Visible-Infrared Object Detection

TL;DR

This paper tackles cross-modal misalignment in visible-infrared object detection by proposing AlignFreeNet, an alignment-free fusion framework. It introduces two core modules, Variation-guided Cross-Modal Compensation (VCC) and Frequency-guided Cross-modal Fusion (FCF), leveraging Haar wavelet transforms and SS2D to enhance per-modality representations and robustly fuse features without explicit alignment. Across three challenging datasets (DVTOD, M3FD, DroneVehicle), AlignFreeNet achieves state-of-the-art performance under severe mixed misalignments while maintaining lightweight efficiency. The work demonstrates that alignment-free fusion, guided by modality variations and frequency-domain gating, can outperform traditional alignment-based strategies in multimodal perception tasks.

Abstract

Cross-modal misalignments, such as spatial offsets, resolution discrepancies, and semantic deficiencies, frequently occur in visible-infrared object detection (VI-OD). To mitigate this, existing methods are typically adapted into an alignment-based fusion paradigm, in which an explicit pixel- or feature-level alignment module is inserted before cross-modal fusion. However, pixel-level alignment struggles to cope with severe or mixed misalignments, whereas feature-level alignment often introduces undesirable noise into fused representations under such conditions, ultimately limiting detection performance. In this paper, we propose a novel alignment-free network (AlignFreeNet) for VI-OD. Differing from prior methods, AlignFreeNet abandons any explicit alignment and instead adopts an alignment-free fusion paradigm. Specifically, AlignFreeNet comprises two core modules: variation-guided cross-modal compensation (VCC) and frequency-guided cross-modal fusion (FCF). VCC adaptively feeds the compensated information derived from cross-modal discrepancies back into each modality, enhancing visible and infrared representations without the noise caused by explicit alignment. FCF achieves robust cross-modal fusion by suppressing task-irrelevant redundancy via frequency-domain gating, effectively mitigating noise introduced in the process. Moreover, VCC and FCF jointly exploit low- and high-frequency cues to preserve foreground contours in fused representations, effectively mitigating cross-modal blending caused by severe mixed misalignments. Extensive evaluations on DVTOD, M3FD, and DroneVehicle demonstrate that our AlignFreeNet achieves state-of-the-art performance under severe mixed misalignment conditions, highlighting its robustness and generalization.

Paper Structure

This paper contains 21 sections, 9 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Illustration of cross-modal misalignments. Common misalignments include: (a) spatial offsets, caused by variations in acquisition conditions such as capture angle and timing; (b) resolution discrepancies, arising from differences in sensor resolutions and focal lengths; and (c) Semantic Deficiencies, stemming from inherent inconsistencies between visible and infrared spectra.
  • Figure 2: Illustration of our method compared with the three existing fusion paradigms. Existing fusion methods typically adopt an alignment-based paradigm, in which pixels or features are explicitly aligned before cross-modal fusion. In contrast, our method leverages modality-variation- and wavelet-guided cues to achieve an alignment-free fusion in the mid-to-late fusion stages, effectively addressing mixed types of misalignment. This design eliminates the dependence on explicit alignment modules while substantially enhancing robustness.
  • Figure 3: The overall structure of our proposed method, alignment-free network (AlignFreeNet). There are three separate VCC and FCF in the backbone for feature fusion. In VCC, we conduct the major cross-modality interaction of our network based on modality variation, and the specific algorithm is provided in Section C. In FCF, we utilize the variation-enhanced feature from VCC as the basis for selectively fusing features through a frequency-domain gated mechanism. Outputs from VCC are sent back to the backbone, and the outputs from FCF are used as the inputs to the detection head.
  • Figure 4: The difference of our model w/ and w/o wavelet transform processing. Our wavelet structure prevents targets from merging into the unrelated background or other targets caused by misalignment.
  • Figure 5: Visualized comparison of our method to CFT, CMA-Det, and ICAFusion on the DVTOD dataset. BLUE triangle labels missing targets. GREEN triangle labels false detection.
  • ...and 2 more figures