Table of Contents
Fetching ...

TFDet: Target-Aware Fusion for RGB-T Pedestrian Detection

Xue Zhang, Xiaohan Zhang, Jiangtao Wang, Jiacheng Ying, Zehua Sheng, Heng Yu, Chunguang Li, Hui-Liang Shen

TL;DR

TFDet tackles the FP-prone issue in multispectral RGB-T pedestrian detection by introducing a target-aware fusion framework. It combines a two-stage fusion pipeline consisting of a Feature Fusion Module (FFM) that exploits parallel- and cross-channel multispectral similarities, and a Feature Refinement Module (FRM) that learns to distinguish target from background via a box-level mask and a correlation-maximum loss. The correlation-maximum loss jointly supervises a segmentation-based target mask and enforces high correlation between the predicted mask and the fused features, dramatically reducing false positives and boosting target contrast. Empirically, TFDet achieves state-of-the-art results on KAIST and LLVIP pedestrian benchmarks and extends to multi-class detection on FLIR and M3FD, while maintaining competitive inference speed, demonstrating practical impact for robust, low-light pedestrian detection in road safety scenarios.

Abstract

Pedestrian detection plays a critical role in computer vision as it contributes to ensuring traffic safety. Existing methods that rely solely on RGB images suffer from performance degradation under low-light conditions due to the lack of useful information. To address this issue, recent multispectral detection approaches have combined thermal images to provide complementary information and have obtained enhanced performances. Nevertheless, few approaches focus on the negative effects of false positives caused by noisy fused feature maps. Different from them, we comprehensively analyze the impacts of false positives on the detection performance and find that enhancing feature contrast can significantly reduce these false positives. In this paper, we propose a novel target-aware fusion strategy for multispectral pedestrian detection, named TFDet. TFDet achieves state-of-the-art performance on two multispectral pedestrian benchmarks, KAIST and LLVIP. TFDet can easily extend to multi-class object detection scenarios. It outperforms the previous best approaches on two multispectral object detection benchmarks, FLIR and M3FD. Importantly, TFDet has comparable inference efficiency to the previous approaches, and has remarkably good detection performance even under low-light conditions, which is a significant advancement for ensuring road safety.

TFDet: Target-Aware Fusion for RGB-T Pedestrian Detection

TL;DR

TFDet tackles the FP-prone issue in multispectral RGB-T pedestrian detection by introducing a target-aware fusion framework. It combines a two-stage fusion pipeline consisting of a Feature Fusion Module (FFM) that exploits parallel- and cross-channel multispectral similarities, and a Feature Refinement Module (FRM) that learns to distinguish target from background via a box-level mask and a correlation-maximum loss. The correlation-maximum loss jointly supervises a segmentation-based target mask and enforces high correlation between the predicted mask and the fused features, dramatically reducing false positives and boosting target contrast. Empirically, TFDet achieves state-of-the-art results on KAIST and LLVIP pedestrian benchmarks and extends to multi-class detection on FLIR and M3FD, while maintaining competitive inference speed, demonstrating practical impact for robust, low-light pedestrian detection in road safety scenarios.

Abstract

Pedestrian detection plays a critical role in computer vision as it contributes to ensuring traffic safety. Existing methods that rely solely on RGB images suffer from performance degradation under low-light conditions due to the lack of useful information. To address this issue, recent multispectral detection approaches have combined thermal images to provide complementary information and have obtained enhanced performances. Nevertheless, few approaches focus on the negative effects of false positives caused by noisy fused feature maps. Different from them, we comprehensively analyze the impacts of false positives on the detection performance and find that enhancing feature contrast can significantly reduce these false positives. In this paper, we propose a novel target-aware fusion strategy for multispectral pedestrian detection, named TFDet. TFDet achieves state-of-the-art performance on two multispectral pedestrian benchmarks, KAIST and LLVIP. TFDet can easily extend to multi-class object detection scenarios. It outperforms the previous best approaches on two multispectral object detection benchmarks, FLIR and M3FD. Importantly, TFDet has comparable inference efficiency to the previous approaches, and has remarkably good detection performance even under low-light conditions, which is a significant advancement for ensuring road safety.
Paper Structure (20 sections, 15 equations, 14 figures, 8 tables)

This paper contains 20 sections, 15 equations, 14 figures, 8 tables.

Figures (14)

  • Figure 1: Visualization of features and detection results generated by a pair of multispectral images using different fusion strategies. (a) The previous fusion strategy generates noisy features and induces many false positives in the background regions. (b) Our target-aware fusion strategy generates discriminative features and effectively reduces false positives. The green arrows in the left column mark the location of pedestrians. The green boxes and red boxes in the right column indicate the ground-truth and predicted bounding boxes, respectively.
  • Figure 2: A pilot study for analyzing the impact of FPs on detection performance. From right to left, the detector's performance gradually improves as FPs are removed based on their (a) confidence scores and (b) IoU ratios with ground-truth boxes.
  • Figure 3: Comparison of MR (%) on the KAIST dataset. We use Faster R-CNN fasterrcnn with VGG-16 vgg as the baseline detector. The best result on each subset is highlighted in bold and marked in red, while the second-best result is underlined and marked in green. The six subsets can be categorized into two groups: (1) Scene - All-Day, Day, and Night; and (2) Distance - Near, Medium, and Far. The Scene subset results are evaluated on a reasonable set where pedestrians are not or partially occluded and have a height higher than 55 pixels. In the Distance subset, only non-occluded pedestrians are evaluated, with height categories of [115, +$\infty$), [45, 115), [1, 45) for Near, Medium, and Far, respectively.
  • Figure 4: Schematic diagram of the feature contrast enhancement foundation. In the mask $\mathbf{m}$, the white region denotes the foreground region, while the dark region denotes the background region. In the backward process, the gradients are only used to update the initial fused feature $\mathbf{F}_x$. Note that in this figure $\mathbf{m}$, $\mathbf{f}^1_x$, and $\mathbf{f}^c_x$ are reshaped to vectors for illustration.
  • Figure 5: The illustration of our TFDet architecture. It consists of three components: backbone, neck, and head. The backbone extracts features from paired multispectral images. The neck fuses these multispectral features using our target-aware fusion strategy. The head generates boxes and corresponding scores based on the fused feature. A more detailed structure of the feature fusion module (FFM) is shown in Fig. \ref{['fig:ffm']}. The correlation-maximum loss function is only used during the training phase of our model.
  • ...and 9 more figures