Table of Contents
Fetching ...

FreDFT: Frequency Domain Fusion Transformer for Visible-Infrared Object Detection

Wencong Wu, Xiuwei Zhang, Hanlin Yin, Shun Dai, Hongxi Zhang, Yanning Zhang

TL;DR

FreDFT addresses modality imbalance in visible–infrared object detection by introducing a frequency-domain fusion transformer. It combines a local feature enhancement module (LFEM), a cross-modal global modeling module (CGMM), and a frequency-domain feature aggregation module (FDFAM) featuring a multimodal frequency domain attention (MFDA) and a frequency-domain feed-forward layer (FDFFL) to fuse cross-modal information. The MFDA replaces spatial-domain attention with frequency-domain correlations, while the FDFFL provides multi-scale frequency representations, enabling robust cross-modal fusion. Across FLIR, LLVIP, and M$^3$FD datasets, FreDFT achieves state-of-the-art results, demonstrating the practical viability of frequency-domain transformers for multispectral detection.

Abstract

Visible-infrared object detection has gained sufficient attention due to its detection performance in low light, fog, and rain conditions. However, visible and infrared modalities captured by different sensors exist the information imbalance problem in complex scenarios, which can cause inadequate cross-modal fusion, resulting in degraded detection performance. \textcolor{red}{Furthermore, most existing methods use transformers in the spatial domain to capture complementary features, ignoring the advantages of developing frequency domain transformers to mine complementary information.} To solve these weaknesses, we propose a frequency domain fusion transformer, called FreDFT, for visible-infrared object detection. The proposed approach employs a novel multimodal frequency domain attention (MFDA) to mine complementary information between modalities and a frequency domain feed-forward layer (FDFFL) via a mixed-scale frequency feature fusion strategy is designed to better enhance multimodal features. To eliminate the imbalance of multimodal information, a cross-modal global modeling module (CGMM) is constructed to perform pixel-wise inter-modal feature interaction in a spatial and channel manner. Moreover, a local feature enhancement module (LFEM) is developed to strengthen multimodal local feature representation and promote multimodal feature fusion by using various convolution layers and applying a channel shuffle. Extensive experimental results have verified that our proposed FreDFT achieves excellent performance on multiple public datasets compared with other state-of-the-art methods. The code of our FreDFT is linked at https://github.com/WenCongWu/FreDFT.

FreDFT: Frequency Domain Fusion Transformer for Visible-Infrared Object Detection

TL;DR

FreDFT addresses modality imbalance in visible–infrared object detection by introducing a frequency-domain fusion transformer. It combines a local feature enhancement module (LFEM), a cross-modal global modeling module (CGMM), and a frequency-domain feature aggregation module (FDFAM) featuring a multimodal frequency domain attention (MFDA) and a frequency-domain feed-forward layer (FDFFL) to fuse cross-modal information. The MFDA replaces spatial-domain attention with frequency-domain correlations, while the FDFFL provides multi-scale frequency representations, enabling robust cross-modal fusion. Across FLIR, LLVIP, and MFD datasets, FreDFT achieves state-of-the-art results, demonstrating the practical viability of frequency-domain transformers for multispectral detection.

Abstract

Visible-infrared object detection has gained sufficient attention due to its detection performance in low light, fog, and rain conditions. However, visible and infrared modalities captured by different sensors exist the information imbalance problem in complex scenarios, which can cause inadequate cross-modal fusion, resulting in degraded detection performance. \textcolor{red}{Furthermore, most existing methods use transformers in the spatial domain to capture complementary features, ignoring the advantages of developing frequency domain transformers to mine complementary information.} To solve these weaknesses, we propose a frequency domain fusion transformer, called FreDFT, for visible-infrared object detection. The proposed approach employs a novel multimodal frequency domain attention (MFDA) to mine complementary information between modalities and a frequency domain feed-forward layer (FDFFL) via a mixed-scale frequency feature fusion strategy is designed to better enhance multimodal features. To eliminate the imbalance of multimodal information, a cross-modal global modeling module (CGMM) is constructed to perform pixel-wise inter-modal feature interaction in a spatial and channel manner. Moreover, a local feature enhancement module (LFEM) is developed to strengthen multimodal local feature representation and promote multimodal feature fusion by using various convolution layers and applying a channel shuffle. Extensive experimental results have verified that our proposed FreDFT achieves excellent performance on multiple public datasets compared with other state-of-the-art methods. The code of our FreDFT is linked at https://github.com/WenCongWu/FreDFT.

Paper Structure

This paper contains 24 sections, 5 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: The architecture of the proposed RGB-IR object detection framework. The FreDFT stands for our proposed frequency domain fusion transformer, which is used to merge multimodal features from the dual backbone network effectively, and these fused features are fed into the neck and detection head to generate the prediction results.
  • Figure 2: The structure of the proposed frequency domain fusion transformer (FreDFT). The LFEM, CGMM, and FDFAM denote the local feature enhancement module, cross-modal global modeling module, and frequency domain feature aggregation module, respectively. $E_{RGB}^L$ and $E_{IR}^L$ are the output of the LFEM, and $E_{RGB}^G$ and $E_{IR}^G$ are the output of the CGMM. $X_f$ is the fused feature.
  • Figure 3: The structure of the designed local feature enhancement module (LFEM).
  • Figure 4: The structure of the designed cross-modal global modeling module (CGMM).
  • Figure 5: The structure of the designed frequency domain feature aggregation module (FDFAM). The FFT, IFFT, and FDFFL represents the fast fourier transform, inverse fast fourier transform, and frequency domain feed-forward layer, respectively.
  • ...and 5 more figures