Heterogeneous Graph Transformer for Multiple Tiny Object Tracking in RGB-T Videos
Qingyu Xu, Longguang Wang, Weidong Sheng, Yingqian Wang, Chao Xiao, Chao Ma, Wei An
TL;DR
The paper tackles robust multi-target tracking of tiny objects in RGB-T videos by formulating joint detection and tracking with a Heterogeneous Graph Transformer (HGT). It introduces HGT-Track, which embeds modality-specific features, constructs a sparse heterogeneous graph across detections and past tracklets, and uses an HGT encoder/decoder to fuse spatial-temporal cues from visible and thermal modalities; a ReDet-based cross-modal re-detection module further stabilizes trajectories. A new VT-Tiny-MOT benchmark is proposed to evaluate RGB-T tiny MOT, with 115 paired sequences and 1.2M annotations across seven categories, emphasizing small targets, occlusions, and modality mismatch. The results demonstrate state-of-the-art performance in MOTA and IDF1, validating the effectiveness of cross-modal graph fusion and re-detection for persistent tracking under challenging conditions, and the work provides a comprehensive dataset and code for further research.
Abstract
Tracking multiple tiny objects is highly challenging due to their weak appearance and limited features. Existing multi-object tracking algorithms generally focus on single-modality scenes, and overlook the complementary characteristics of tiny objects captured by multiple remote sensors. To enhance tracking performance by integrating complementary information from multiple sources, we propose a novel framework called {HGT-Track (Heterogeneous Graph Transformer based Multi-Tiny-Object Tracking)}. Specifically, we first employ a Transformer-based encoder to embed images from different modalities. Subsequently, we utilize Heterogeneous Graph Transformer to aggregate spatial and temporal information from multiple modalities to generate detection and tracking features. Additionally, we introduce a target re-detection module (ReDet) to ensure tracklet continuity by maintaining consistency across different modalities. Furthermore, this paper introduces the first benchmark VT-Tiny-MOT (Visible-Thermal Tiny Multi-Object Tracking) for RGB-T fused multiple tiny object tracking. Extensive experiments are conducted on VT-Tiny-MOT, and the results have demonstrated the effectiveness of our method. Compared to other state-of-the-art methods, our method achieves better performance in terms of MOTA (Multiple-Object Tracking Accuracy) and ID-F1 score. The code and dataset will be made available at https://github.com/xuqingyu26/HGTMT.
