Table of Contents
Fetching ...

RGB-T Object Detection via Group Shuffled Multi-receptive Attention and Multi-modal Supervision

Jinzhong Wang, Xuetao Tian, Shun Dai, Tao Zhuo, Haorui Zeng, Hongjuan Liu, Jiaqi Liu, Xiuwei Zhang, Yanning Zhang

TL;DR

This work targets robust RGB-T object detection under diverse conditions by introducing a lightweight Group Shuffled Multi-receptive Attention (GSMA) module for efficient multi-scale feature fusion and a Multi-modal Supervision (MS) strategy that supervises RGB, thermal, and fusion branches with per-modal annotations. Integrated into a YOLOv5-based framework named SAMS-YOLO, GSMA leverages multi-scale receptive fields and a parameter-free group shuffle to fuse RGB-T features, while MS provides more accurate supervision and auxiliary segmentation during training. The approach achieves state-of-the-art results on KAIST and DroneVehicle with competitive inference speed, demonstrating improved detection of small, night-time, and occluded objects. Collectively, the method advances practical RGB-T detection by balancing fusion quality and efficiency, with strong implications for autonomous driving and surveillance systems.

Abstract

Multispectral object detection, utilizing both visible (RGB) and thermal infrared (T) modals, has garnered significant attention for its robust performance across diverse weather and lighting conditions. However, effectively exploiting the complementarity between RGB-T modals while maintaining efficiency remains a critical challenge. In this paper, a very simple Group Shuffled Multi-receptive Attention (GSMA) module is proposed to extract and combine multi-scale RGB and thermal features. Then, the extracted multi-modal features are directly integrated with a multi-level path aggregation neck, which significantly improves the fusion effect and efficiency. Meanwhile, multi-modal object detection often adopts union annotations for both modals. This kind of supervision is not sufficient and unfair, since objects observed in one modal may not be seen in the other modal. To solve this issue, Multi-modal Supervision (MS) is proposed to sufficiently supervise RGB-T object detection. Comprehensive experiments on two challenging benchmarks, KAIST and DroneVehicle, demonstrate the proposed model achieves the state-of-the-art accuracy while maintaining competitive efficiency.

RGB-T Object Detection via Group Shuffled Multi-receptive Attention and Multi-modal Supervision

TL;DR

This work targets robust RGB-T object detection under diverse conditions by introducing a lightweight Group Shuffled Multi-receptive Attention (GSMA) module for efficient multi-scale feature fusion and a Multi-modal Supervision (MS) strategy that supervises RGB, thermal, and fusion branches with per-modal annotations. Integrated into a YOLOv5-based framework named SAMS-YOLO, GSMA leverages multi-scale receptive fields and a parameter-free group shuffle to fuse RGB-T features, while MS provides more accurate supervision and auxiliary segmentation during training. The approach achieves state-of-the-art results on KAIST and DroneVehicle with competitive inference speed, demonstrating improved detection of small, night-time, and occluded objects. Collectively, the method advances practical RGB-T detection by balancing fusion quality and efficiency, with strong implications for autonomous driving and surveillance systems.

Abstract

Multispectral object detection, utilizing both visible (RGB) and thermal infrared (T) modals, has garnered significant attention for its robust performance across diverse weather and lighting conditions. However, effectively exploiting the complementarity between RGB-T modals while maintaining efficiency remains a critical challenge. In this paper, a very simple Group Shuffled Multi-receptive Attention (GSMA) module is proposed to extract and combine multi-scale RGB and thermal features. Then, the extracted multi-modal features are directly integrated with a multi-level path aggregation neck, which significantly improves the fusion effect and efficiency. Meanwhile, multi-modal object detection often adopts union annotations for both modals. This kind of supervision is not sufficient and unfair, since objects observed in one modal may not be seen in the other modal. To solve this issue, Multi-modal Supervision (MS) is proposed to sufficiently supervise RGB-T object detection. Comprehensive experiments on two challenging benchmarks, KAIST and DroneVehicle, demonstrate the proposed model achieves the state-of-the-art accuracy while maintaining competitive efficiency.
Paper Structure (16 sections, 3 equations, 6 figures, 5 tables)

This paper contains 16 sections, 3 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Architecture of the proposed SAMS-YOLO. The multi-modal supervision strategy is applied to the RGB, thermal, and fusion branches. During training, the RGB, thermal, and union annotations are used as supervision to calculate detection loss. During inference, a decision-level fusion is applied to fuse the RGB, thermal, and fusion branch results.
  • Figure 2: The structure of Group Shuffled Multi-receptive Attention module. (a) shows the data flow structure of the GSMA. (b) shows the SPC structure in (a).
  • Figure 3: The structure of the Group Shuffle.
  • Figure 4: Illustration of the modal misalignment problem in the DroneVehicle training set. (a) and (b) depict the original and aligned RGB-T image pairs, where the top images are thermal images and the bottom images are visible images. Yellow boxes indicate annotations on the thermal images, while red boxes indicate annotations on the visible images. Both modal annotations are visualized on the visible image.
  • Figure 5: Impacts of RGB-T feature concatenation and group shuffle. The top and third rows depict the feature map values along the channel dimension, while the second and fourth rows display RGB-T images and corresponding heatmaps. The top two rows showcase day-time scenes, whereas the bottom two rows depict night-time scenes. Notably, the response of RGB features diminishes during night-time. As observed in the feature maps and heatmaps in the bottom right corner, compared to simple concatenation, the group shuffle operation achieves more comprehensive multi-modal feature mixing. Through the GSMA module and multi-path aggregation fusion, the network exhibits heightened attention towards pedestrian areas.
  • ...and 1 more figures