Table of Contents
Fetching ...

Optimizing Multispectral Object Detection: A Bag of Tricks and Comprehensive Benchmarks

Chen Zhou, Peng Cheng, Junfeng Fang, Yifan Zhang, Yibo Yan, Xiaojun Jia, Yanyan Xu, Kun Wang, Xiaochun Cao

TL;DR

This work proposes the first fair and reproducible benchmark specifically designed to evaluate the training"techniques", which systematically classifies existing multispectral object detection methods, investigates their sensitivity to hyper-parameters, and standardizes the core configurations.

Abstract

Multispectral object detection, utilizing RGB and TIR (thermal infrared) modalities, is widely recognized as a challenging task. It requires not only the effective extraction of features from both modalities and robust fusion strategies, but also the ability to address issues such as spectral discrepancies, spatial misalignment, and environmental dependencies between RGB and TIR images. These challenges significantly hinder the generalization of multispectral detection systems across diverse scenarios. Although numerous studies have attempted to overcome these limitations, it remains difficult to clearly distinguish the performance gains of multispectral detection systems from the impact of these "optimization techniques". Worse still, despite the rapid emergence of high-performing single-modality detection models, there is still a lack of specialized training techniques that can effectively adapt these models for multispectral detection tasks. The absence of a standardized benchmark with fair and consistent experimental setups also poses a significant barrier to evaluating the effectiveness of new approaches. To this end, we propose the first fair and reproducible benchmark specifically designed to evaluate the training "techniques", which systematically classifies existing multispectral object detection methods, investigates their sensitivity to hyper-parameters, and standardizes the core configurations. A comprehensive evaluation is conducted across multiple representative multispectral object detection datasets, utilizing various backbone networks and detection frameworks. Additionally, we introduce an efficient and easily deployable multispectral object detection framework that can seamlessly optimize high-performing single-modality models into dual-modality models, integrating our advanced training techniques.

Optimizing Multispectral Object Detection: A Bag of Tricks and Comprehensive Benchmarks

TL;DR

This work proposes the first fair and reproducible benchmark specifically designed to evaluate the training"techniques", which systematically classifies existing multispectral object detection methods, investigates their sensitivity to hyper-parameters, and standardizes the core configurations.

Abstract

Multispectral object detection, utilizing RGB and TIR (thermal infrared) modalities, is widely recognized as a challenging task. It requires not only the effective extraction of features from both modalities and robust fusion strategies, but also the ability to address issues such as spectral discrepancies, spatial misalignment, and environmental dependencies between RGB and TIR images. These challenges significantly hinder the generalization of multispectral detection systems across diverse scenarios. Although numerous studies have attempted to overcome these limitations, it remains difficult to clearly distinguish the performance gains of multispectral detection systems from the impact of these "optimization techniques". Worse still, despite the rapid emergence of high-performing single-modality detection models, there is still a lack of specialized training techniques that can effectively adapt these models for multispectral detection tasks. The absence of a standardized benchmark with fair and consistent experimental setups also poses a significant barrier to evaluating the effectiveness of new approaches. To this end, we propose the first fair and reproducible benchmark specifically designed to evaluate the training "techniques", which systematically classifies existing multispectral object detection methods, investigates their sensitivity to hyper-parameters, and standardizes the core configurations. A comprehensive evaluation is conducted across multiple representative multispectral object detection datasets, utilizing various backbone networks and detection frameworks. Additionally, we introduce an efficient and easily deployable multispectral object detection framework that can seamlessly optimize high-performing single-modality models into dual-modality models, integrating our advanced training techniques.

Paper Structure

This paper contains 16 sections, 5 figures, 6 tables.

Figures (5)

  • Figure 2: The performance of general geometric and pixel-level augmentations (using different backbones) on the KAIST, FLIR, and DroneVehicle datasets. The left figure illustrates the results of various geometric augmentations, where B denotes the baseline, R represents random rotation, S signifies multi-scale scaling, C stands for random cropping, F corresponds to random flipping, and T indicates random translation. The right figure presents the results of general pixel-level augmentations, with B as the baseline, BL for random blurring, NI for noise injection, S for random sharpening, O for random occlusion, and CJ for color jittering. $\triangle$ represents the mean performance difference between this method and the baseline.
  • Figure 4: The performance metrics of different augmentation strategies on small sample sets. $\triangle$ represents the mean performance difference between this method and the baseline. In this figure, B represents the baseline, S denotes Stitcher chen2021dynamicscaletrainingobject, F stands for Fastmosaic kumar2020yolov3, R represents Region Resampling, and M indicates Small-Object Magnification.
  • Figure 5: Comparison of registration results using LoFTR and SuperFusion under different viewpoints and lighting conditions. The first and second rows present the RGB and TIR channel images, respectively. The third and fourth rows showcase the registration outcomes of the LoFTR and SuperFusion methods. Regions with significant registration discrepancies are highlighted.
  • Figure 6: Visualization of intermediate point registration results using the LoFTR method in sparse and dense sample scenarios.
  • Figure 8: Ablation experiment results on the KAIST, FLIR, and DroneVehicle datasets. The experimental configurations strictly adhere to the setups outlined in the "Best Technique Combination".