RGB-T Object Detection via Group Shuffled Multi-receptive Attention and Multi-modal Supervision
Jinzhong Wang, Xuetao Tian, Shun Dai, Tao Zhuo, Haorui Zeng, Hongjuan Liu, Jiaqi Liu, Xiuwei Zhang, Yanning Zhang
TL;DR
This work targets robust RGB-T object detection under diverse conditions by introducing a lightweight Group Shuffled Multi-receptive Attention (GSMA) module for efficient multi-scale feature fusion and a Multi-modal Supervision (MS) strategy that supervises RGB, thermal, and fusion branches with per-modal annotations. Integrated into a YOLOv5-based framework named SAMS-YOLO, GSMA leverages multi-scale receptive fields and a parameter-free group shuffle to fuse RGB-T features, while MS provides more accurate supervision and auxiliary segmentation during training. The approach achieves state-of-the-art results on KAIST and DroneVehicle with competitive inference speed, demonstrating improved detection of small, night-time, and occluded objects. Collectively, the method advances practical RGB-T detection by balancing fusion quality and efficiency, with strong implications for autonomous driving and surveillance systems.
Abstract
Multispectral object detection, utilizing both visible (RGB) and thermal infrared (T) modals, has garnered significant attention for its robust performance across diverse weather and lighting conditions. However, effectively exploiting the complementarity between RGB-T modals while maintaining efficiency remains a critical challenge. In this paper, a very simple Group Shuffled Multi-receptive Attention (GSMA) module is proposed to extract and combine multi-scale RGB and thermal features. Then, the extracted multi-modal features are directly integrated with a multi-level path aggregation neck, which significantly improves the fusion effect and efficiency. Meanwhile, multi-modal object detection often adopts union annotations for both modals. This kind of supervision is not sufficient and unfair, since objects observed in one modal may not be seen in the other modal. To solve this issue, Multi-modal Supervision (MS) is proposed to sufficiently supervise RGB-T object detection. Comprehensive experiments on two challenging benchmarks, KAIST and DroneVehicle, demonstrate the proposed model achieves the state-of-the-art accuracy while maintaining competitive efficiency.
