Table of Contents
Fetching ...

Learning A Robust RGB-Thermal Detector for Extreme Modality Imbalance

Chao Tian, Chao Yang, Guoqing Zhu, Qiang Wang, Zhenyu He

TL;DR

The paper tackles extreme modality imbalance in RGB-T object detection by introducing a base-and-auxiliary detector framework guided by a quality-aware modality interaction module and a pseudo degradation training strategy. A consistency loss between the base (EMA-updated) and auxiliary detectors stabilizes learning under degraded samples, enabling robust performance when one modality is missing or corrupted. Empirical results on KAIST and FLIR show substantial robustness improvements and convergence benefits, reducing Miss Rate under challenging conditions and outperforming strong baselines across multiple settings. The approach is extensible to other two-stream RGB-T detectors, offering practical impact for autonomous systems operating under adverse sensing conditions, though future work should address model efficiency for deployment.

Abstract

RGB-Thermal (RGB-T) object detection utilizes thermal infrared (TIR) images to complement RGB data, improving robustness in challenging conditions. Traditional RGB-T detectors assume balanced training data, where both modalities contribute equally. However, in real-world scenarios, modality degradation-due to environmental factors or technical issues-can lead to extreme modality imbalance, causing out-of-distribution (OOD) issues during testing and disrupting model convergence during training. This paper addresses these challenges by proposing a novel base-and-auxiliary detector architecture. We introduce a modality interaction module to adaptively weigh modalities based on their quality and handle imbalanced samples effectively. Additionally, we leverage modality pseudo-degradation to simulate real-world imbalances in training data. The base detector, trained on high-quality pairs, provides a consistency constraint for the auxiliary detector, which receives degraded samples. This framework enhances model robustness, ensuring reliable performance even under severe modality degradation. Experimental results demonstrate the effectiveness of our method in handling extreme modality imbalances~(decreasing the Missing Rate by 55%) and improving performance across various baseline detectors.

Learning A Robust RGB-Thermal Detector for Extreme Modality Imbalance

TL;DR

The paper tackles extreme modality imbalance in RGB-T object detection by introducing a base-and-auxiliary detector framework guided by a quality-aware modality interaction module and a pseudo degradation training strategy. A consistency loss between the base (EMA-updated) and auxiliary detectors stabilizes learning under degraded samples, enabling robust performance when one modality is missing or corrupted. Empirical results on KAIST and FLIR show substantial robustness improvements and convergence benefits, reducing Miss Rate under challenging conditions and outperforming strong baselines across multiple settings. The approach is extensible to other two-stream RGB-T detectors, offering practical impact for autonomous systems operating under adverse sensing conditions, though future work should address model efficiency for deployment.

Abstract

RGB-Thermal (RGB-T) object detection utilizes thermal infrared (TIR) images to complement RGB data, improving robustness in challenging conditions. Traditional RGB-T detectors assume balanced training data, where both modalities contribute equally. However, in real-world scenarios, modality degradation-due to environmental factors or technical issues-can lead to extreme modality imbalance, causing out-of-distribution (OOD) issues during testing and disrupting model convergence during training. This paper addresses these challenges by proposing a novel base-and-auxiliary detector architecture. We introduce a modality interaction module to adaptively weigh modalities based on their quality and handle imbalanced samples effectively. Additionally, we leverage modality pseudo-degradation to simulate real-world imbalances in training data. The base detector, trained on high-quality pairs, provides a consistency constraint for the auxiliary detector, which receives degraded samples. This framework enhances model robustness, ensuring reliable performance even under severe modality degradation. Experimental results demonstrate the effectiveness of our method in handling extreme modality imbalances~(decreasing the Missing Rate by 55%) and improving performance across various baseline detectors.

Paper Structure

This paper contains 14 sections, 8 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Illustration of the modality degradation. The modality degradation would occur due to electrical failure or other unexpected reasons. The degradation causes the out-of-distribution issue in testing and disturbs the convergence in training.
  • Figure 2: Illustration of our proposed architecture. The green area is the base detector, where the detailed architecture of the backbone is presented. The proposed interaction module evaluates the quality of each modality and reweights them before fusion. The auxiliary detector (red) has the same network architecture as the base detector and updates by supervised training. The balanced original samples are fed to the base detector while the degraded ones (maybe the original ones) are fed to the auxiliary detector. During supervised training, aside from the detection loss of the auxiliary detector, the consistency of logits between the base and auxiliary detector is constrained. The stem means the fore-modules of each stream and the CSP layer is a standard module in the CSPDarknet yolov4.
  • Figure 3: Qualitative comparison on KAIST and FLIR benchmark, including a reproduced vanilla YOLOX-RGBT, CPFM tian2023cross and ours in different conditions. All these detectors have perfect predictions with both modalities. However, our method outperforms other detectors when meeting the imbalanced data on a local or global scale.
  • Figure 4: The $\mu$ and $\sigma$ are solved by a linear scaling principles. The scaling maps the range of [-2.5, 2.5] to [0,255], where the integral probability is 0.995.
  • Figure 5: Vanilla modal augmentation disturbs the convergence of the model. Our method effectively improves the robustness of training on the KAIST benchmark.
  • ...and 1 more figures