Table of Contents
Fetching ...

$\mathbf{C}^2$Former: Calibrated and Complementary Transformer for RGB-Infrared Object Detection

Maoxun Yuan, Xingxing Wei

TL;DR

The paper addresses RGB-IR object detection under challenging miscalibration and fusion imprecision by introducing $C^2$Former, a transformer-based module that combines Inter-modality Cross-Attention ($ICA$) with Adaptive Feature Sampling ($AFS$) to calibrate and complement cross-modal features. By plugging $C^2$Former into both a single-stage and a two-stage detector, the authors demonstrate consistent gains on challenging aerial datasets DroneVehicle and KAIST, achieving state-of-the-art multispectral fusion performance. Key contributions include calibrating cross-modal features through cross-attention, reducing computational cost with adaptive sampling, and validating effectiveness through extensive ablations, visualizations, and comparisons with SOTA baselines. The work advances robust, all-day RGB-IR detection with practical applicability in surveillance and autonomous systems, and provides code for reproducibility.

Abstract

Object detection on visible (RGB) and infrared (IR) images, as an emerging solution to facilitate robust detection for around-the-clock applications, has received extensive attention in recent years. With the help of IR images, object detectors have been more reliable and robust in practical applications by using RGB-IR combined information. However, existing methods still suffer from modality miscalibration and fusion imprecision problems. Since transformer has the powerful capability to model the pairwise correlations between different features, in this paper, we propose a novel Calibrated and Complementary Transformer called $\mathrm{C}^2$Former to address these two problems simultaneously. In $\mathrm{C}^2$Former, we design an Inter-modality Cross-Attention (ICA) module to obtain the calibrated and complementary features by learning the cross-attention relationship between the RGB and IR modality. To reduce the computational cost caused by computing the global attention in ICA, an Adaptive Feature Sampling (AFS) module is introduced to decrease the dimension of feature maps. Because $\mathrm{C}^2$Former performs in the feature domain, it can be embedded into existed RGB-IR object detectors via the backbone network. Thus, one single-stage and one two-stage object detector both incorporating our $\mathrm{C}^2$Former are constructed to evaluate its effectiveness and versatility. With extensive experiments on the DroneVehicle and KAIST RGB-IR datasets, we verify that our method can fully utilize the RGB-IR complementary information and achieve robust detection results. The code is available at https://github.com/yuanmaoxun/Calibrated-and-Complementary-Transformer-for-RGB-Infrared-Object-Detection.git.

$\mathbf{C}^2$Former: Calibrated and Complementary Transformer for RGB-Infrared Object Detection

TL;DR

The paper addresses RGB-IR object detection under challenging miscalibration and fusion imprecision by introducing Former, a transformer-based module that combines Inter-modality Cross-Attention () with Adaptive Feature Sampling () to calibrate and complement cross-modal features. By plugging Former into both a single-stage and a two-stage detector, the authors demonstrate consistent gains on challenging aerial datasets DroneVehicle and KAIST, achieving state-of-the-art multispectral fusion performance. Key contributions include calibrating cross-modal features through cross-attention, reducing computational cost with adaptive sampling, and validating effectiveness through extensive ablations, visualizations, and comparisons with SOTA baselines. The work advances robust, all-day RGB-IR detection with practical applicability in surveillance and autonomous systems, and provides code for reproducibility.

Abstract

Object detection on visible (RGB) and infrared (IR) images, as an emerging solution to facilitate robust detection for around-the-clock applications, has received extensive attention in recent years. With the help of IR images, object detectors have been more reliable and robust in practical applications by using RGB-IR combined information. However, existing methods still suffer from modality miscalibration and fusion imprecision problems. Since transformer has the powerful capability to model the pairwise correlations between different features, in this paper, we propose a novel Calibrated and Complementary Transformer called Former to address these two problems simultaneously. In Former, we design an Inter-modality Cross-Attention (ICA) module to obtain the calibrated and complementary features by learning the cross-attention relationship between the RGB and IR modality. To reduce the computational cost caused by computing the global attention in ICA, an Adaptive Feature Sampling (AFS) module is introduced to decrease the dimension of feature maps. Because Former performs in the feature domain, it can be embedded into existed RGB-IR object detectors via the backbone network. Thus, one single-stage and one two-stage object detector both incorporating our Former are constructed to evaluate its effectiveness and versatility. With extensive experiments on the DroneVehicle and KAIST RGB-IR datasets, we verify that our method can fully utilize the RGB-IR complementary information and achieve robust detection results. The code is available at https://github.com/yuanmaoxun/Calibrated-and-Complementary-Transformer-for-RGB-Infrared-Object-Detection.git.
Paper Structure (18 sections, 11 equations, 12 figures, 6 tables)

This paper contains 18 sections, 11 equations, 12 figures, 6 tables.

Figures (12)

  • Figure 1: Two major problems of RGB-IR object detection. (a) An example of modality miscalibration between RGB and Infrared modalities. The yellow and red boxes represent annotations of same objects in the IR images and the RGB images, respectively. (b) An example of fusion imprecision output by MBNet zhou2020improving. We see that the fusion features in the red boxes are even worse than the infrared feature, which shows the difficulty in feature fusion.
  • Figure 2: An illustration result of ICA module. We see the attention values are perfectly aligned with the referenced RGB features and enhance the objects' regions compared with the original IR features, which are more complementary to RGB features.
  • Figure 3: The process of ICA module. $\bigoplus$ and $\bigodot$ indicate the addition and dot product operations respectively. In Modality Normalization, we first normalize the feature distribution of one modality and then inject the mean and variance predicted by the CNN network into the normalized feature distribution to achieve the transformation of the feature distribution.
  • Figure 4: The structure of AFS module, which is used to obtain sampled and coarse-aligned features.
  • Figure 5: The framework of the $\textup{C}^2$Former-based detectors. Our $\textup{C}^2$Former consists of ICA and AFS, where the input features are first reduced in feature dimension through the pre-module AFS and then output the aligned and fused features by the ICA module. The outputs of $\textup{C}^2$Former are added into opposite ResNet-50 backbone networks by using an addition operation. For clear illustration, we do not show the FPN structure in this framework. Same as the last layer operation, we add the features output by each stage of the two modalities to obtain multi-scale features and finally input them into the FPN.
  • ...and 7 more figures