Table of Contents
Fetching ...

Rethinking Early-Fusion Strategies for Improved Multispectral Object Detection

Xue Zhang, Si-Yuan Cao, Fang Wang, Runmin Zhang, Zhe Wu, Xiaohan Zhang, Xiaokai Bai, Hui-Liang Shen

TL;DR

This paper revisits the reasons causing the performance gap between single-branch structures and reveals the information interference problem in the naive early-fusion strategy adopted by previous single-branch structures, and proposes corresponding solutions.

Abstract

Most recent multispectral object detectors employ a two-branch structure to extract features from RGB and thermal images. While the two-branch structure achieves better performance than a single-branch structure, it overlooks inference efficiency. This conflict is increasingly aggressive, as recent works solely pursue higher performance rather than both performance and efficiency. In this paper, we address this issue by improving the performance of efficient single-branch structures. We revisit the reasons causing the performance gap between these structures. For the first time, we reveal the information interference problem in the naive early-fusion strategy adopted by previous single-branch structures. Besides, we find that the domain gap between multispectral images, and weak feature representation of the single-branch structure are also key obstacles for performance. Focusing on these three problems, we propose corresponding solutions, including a novel shape-priority early-fusion strategy, a weakly supervised learning method, and a core knowledge distillation technique. Experiments demonstrate that single-branch networks equipped with these three contributions achieve significant performance enhancements while retaining high efficiency. Our code will be available at \url{https://github.com/XueZ-phd/Efficient-RGB-T-Early-Fusion-Detection}.

Rethinking Early-Fusion Strategies for Improved Multispectral Object Detection

TL;DR

This paper revisits the reasons causing the performance gap between single-branch structures and reveals the information interference problem in the naive early-fusion strategy adopted by previous single-branch structures, and proposes corresponding solutions.

Abstract

Most recent multispectral object detectors employ a two-branch structure to extract features from RGB and thermal images. While the two-branch structure achieves better performance than a single-branch structure, it overlooks inference efficiency. This conflict is increasingly aggressive, as recent works solely pursue higher performance rather than both performance and efficiency. In this paper, we address this issue by improving the performance of efficient single-branch structures. We revisit the reasons causing the performance gap between these structures. For the first time, we reveal the information interference problem in the naive early-fusion strategy adopted by previous single-branch structures. Besides, we find that the domain gap between multispectral images, and weak feature representation of the single-branch structure are also key obstacles for performance. Focusing on these three problems, we propose corresponding solutions, including a novel shape-priority early-fusion strategy, a weakly supervised learning method, and a core knowledge distillation technique. Experiments demonstrate that single-branch networks equipped with these three contributions achieve significant performance enhancements while retaining high efficiency. Our code will be available at \url{https://github.com/XueZ-phd/Efficient-RGB-T-Early-Fusion-Detection}.
Paper Structure (17 sections, 19 equations, 16 figures, 8 tables)

This paper contains 17 sections, 19 equations, 16 figures, 8 tables.

Figures (16)

  • Figure 1: Multispectral object detection and fusion strategies. (a) In Scene-1, objects are easier to detect in the thermal image. (b) In Scene-2, objects are easier to detect in the RGB image. (c) Early-fusion strategy. (d) Medium-fusion strategy. (e) Late-fusion strategy. (f) Detection results of different strategies on the M3FD dataset liu2022target. YOLOv5 yolov5 is adopted as the baseline in this experiment. The area of each circle denotes the number of parameters.
  • Figure 1: Inference efficiency and detection performance on the M3FD dataset liu2022target. The inference time is evaluated on an edge device: NVIDIA AGX Orin. The best results in the mAP and mAP50 columns are highlighted in bold and marked in red, while the second best ones are underlined and marked in green. All detection results are obtained by running three independent experiments. The mean value and standard deviation of these results are reported.
  • Figure 2: Overview of our method. We adopt the single-branch structure as the baseline model and develop three key modules: shape-priority early-fusion (ShaPE), weakly supervised auxiliary learning, and core knowledge distillation. The ShaPE module remains in both the inference and training phases, while the other two modules are removed in the inference phase.
  • Figure 3: Pilot studies conducted on the M3FD liu2022target dataset. We use three detectors as baselines: RetinaNet lin2017focal, GFL gfl and YOLOv5 yolov5. Each bar and error bar represents the mean values and standard deviation of the results obtained by these three detectors. 'RGB' represents detectors that only take RGB images as inputs, while 'T' represents detectors that only take thermal images as inputs. 'PlainRGB-T' denotes detectors that use the plain early-fusion strategy. The 'All' column illustrates the mAP50 for all classes, and the other columns illustrate the AP50 for specific classes. Red lines denote the plain RGB-T early fusion strategy obtains worse results compared to detectors that use single-modality inputs.
  • Figure 4: Illustration of fused feature map generation process for the plain early-fusion strategy and our ShaPE module. (a) RGB image. (b) Thermal image. (c) Fused feature map generated using the plain early-fusion strategy, with a close-up indicated by a white circle line. (d) and (e) are gradient images of the RGB and thermal images, respectively. (f) Boosted reference gradient image. (g) and (h) are self-gating masks of the RGB and thermal images, respectively. (i) Fused feature map generated by our ShaPE module.
  • ...and 11 more figures