Table of Contents
Fetching ...

Heterogeneous Graph Transformer for Multiple Tiny Object Tracking in RGB-T Videos

Qingyu Xu, Longguang Wang, Weidong Sheng, Yingqian Wang, Chao Xiao, Chao Ma, Wei An

TL;DR

The paper tackles robust multi-target tracking of tiny objects in RGB-T videos by formulating joint detection and tracking with a Heterogeneous Graph Transformer (HGT). It introduces HGT-Track, which embeds modality-specific features, constructs a sparse heterogeneous graph across detections and past tracklets, and uses an HGT encoder/decoder to fuse spatial-temporal cues from visible and thermal modalities; a ReDet-based cross-modal re-detection module further stabilizes trajectories. A new VT-Tiny-MOT benchmark is proposed to evaluate RGB-T tiny MOT, with 115 paired sequences and 1.2M annotations across seven categories, emphasizing small targets, occlusions, and modality mismatch. The results demonstrate state-of-the-art performance in MOTA and IDF1, validating the effectiveness of cross-modal graph fusion and re-detection for persistent tracking under challenging conditions, and the work provides a comprehensive dataset and code for further research.

Abstract

Tracking multiple tiny objects is highly challenging due to their weak appearance and limited features. Existing multi-object tracking algorithms generally focus on single-modality scenes, and overlook the complementary characteristics of tiny objects captured by multiple remote sensors. To enhance tracking performance by integrating complementary information from multiple sources, we propose a novel framework called {HGT-Track (Heterogeneous Graph Transformer based Multi-Tiny-Object Tracking)}. Specifically, we first employ a Transformer-based encoder to embed images from different modalities. Subsequently, we utilize Heterogeneous Graph Transformer to aggregate spatial and temporal information from multiple modalities to generate detection and tracking features. Additionally, we introduce a target re-detection module (ReDet) to ensure tracklet continuity by maintaining consistency across different modalities. Furthermore, this paper introduces the first benchmark VT-Tiny-MOT (Visible-Thermal Tiny Multi-Object Tracking) for RGB-T fused multiple tiny object tracking. Extensive experiments are conducted on VT-Tiny-MOT, and the results have demonstrated the effectiveness of our method. Compared to other state-of-the-art methods, our method achieves better performance in terms of MOTA (Multiple-Object Tracking Accuracy) and ID-F1 score. The code and dataset will be made available at https://github.com/xuqingyu26/HGTMT.

Heterogeneous Graph Transformer for Multiple Tiny Object Tracking in RGB-T Videos

TL;DR

The paper tackles robust multi-target tracking of tiny objects in RGB-T videos by formulating joint detection and tracking with a Heterogeneous Graph Transformer (HGT). It introduces HGT-Track, which embeds modality-specific features, constructs a sparse heterogeneous graph across detections and past tracklets, and uses an HGT encoder/decoder to fuse spatial-temporal cues from visible and thermal modalities; a ReDet-based cross-modal re-detection module further stabilizes trajectories. A new VT-Tiny-MOT benchmark is proposed to evaluate RGB-T tiny MOT, with 115 paired sequences and 1.2M annotations across seven categories, emphasizing small targets, occlusions, and modality mismatch. The results demonstrate state-of-the-art performance in MOTA and IDF1, validating the effectiveness of cross-modal graph fusion and re-detection for persistent tracking under challenging conditions, and the work provides a comprehensive dataset and code for further research.

Abstract

Tracking multiple tiny objects is highly challenging due to their weak appearance and limited features. Existing multi-object tracking algorithms generally focus on single-modality scenes, and overlook the complementary characteristics of tiny objects captured by multiple remote sensors. To enhance tracking performance by integrating complementary information from multiple sources, we propose a novel framework called {HGT-Track (Heterogeneous Graph Transformer based Multi-Tiny-Object Tracking)}. Specifically, we first employ a Transformer-based encoder to embed images from different modalities. Subsequently, we utilize Heterogeneous Graph Transformer to aggregate spatial and temporal information from multiple modalities to generate detection and tracking features. Additionally, we introduce a target re-detection module (ReDet) to ensure tracklet continuity by maintaining consistency across different modalities. Furthermore, this paper introduces the first benchmark VT-Tiny-MOT (Visible-Thermal Tiny Multi-Object Tracking) for RGB-T fused multiple tiny object tracking. Extensive experiments are conducted on VT-Tiny-MOT, and the results have demonstrated the effectiveness of our method. Compared to other state-of-the-art methods, our method achieves better performance in terms of MOTA (Multiple-Object Tracking Accuracy) and ID-F1 score. The code and dataset will be made available at https://github.com/xuqingyu26/HGTMT.

Paper Structure

This paper contains 40 sections, 17 equations, 8 figures, 6 tables, 1 algorithm.

Figures (8)

  • Figure 1: Examples of 7 kinds of targets captured under 8 main scenarios in the VT-Tiny-MOT dataset are provided, along with the annotated MOT challenges (the list of challenge attributes are reported in Table \ref{['lay']}). The involved challenges are MM (MisMatch), LI (Low Illumination), ETO (Extemely Tiny Object), CM (Camera Motion), OCC (OCClusion), TC (Thermal Crossover) and FM (Fast Move).
  • Figure 2: The overall framework of Heterogeneous Graph Transformer based tracking method , which consists of four parts. (a) With the input of two paired visible and thermal images at time $k-1$ and $k$, we generate modal-specific feature through the embedding layer. Then, we build a heterogeneous graph $G^a$ considering the target difference between two modalities. (b) Next, we utilize the modlified HGT module for information integration, where the encoder output the detection feature and the decoder output the tracking feature. (c) A linear regression layer is used to generate the tracking offset, while the target detection is generated through conducting a top-k post process on detection feature. (d) Finally, tracklet is generated and further refined through cross-modal detection matching and ReDet module.
  • Figure 3: The structure of the Heterogeneous Graph Transformer encoder is illustrated in (a), and the details of the Heterogeneous Graph Transformer (HGT) are shown in (b). Multi-source information is aggregated to detection queries through HGT by setting the detection queries as target nodes and other types of nodes as source nodes. Then, the information is gradually integrated into $\tilde{D}$ from multiple stages of detection queries through the aggregation module.
  • Figure 4: Illustration of the heterogeneous transformer decoder for the generation of tracking feature $\tilde{T}_k^v$ and $\tilde{T}_k^t$. Given the input of four types of nodes, namely ${D_k^{v/t}}$ and ${T_{k-1}^{v/t}}$, we set ${T_{k-1}^{v/t}}$ as the target node. The Heterogeneous Transformer (HGT) integrates information from ${D_k^{v/t}}$ to generate ${\tilde{T}_{k-1}^{v/t}}$. Subsequently, the tarcking feature ${\tilde{T}_{k}^{v/t}}$ are generated using deformable attention strengthening.
  • Figure 5: An example of the region area to re-detect the lost target. If the target is lost by the visible camera but can still be detected by the thermal camera, we initiate a re-detection procedure within a search region $SR$ colored by gray at time $k$, as depicted in (a). This search region encompasses the union area of the target's bounding box at time $k-1$ in the visible camera and time $k$ in the thermal camera. In (b), we present the heatmap generated using ReDet to locate the lost target. Comparing it with the original HGT detection heatmap shown in (c), we can observe that even in a close-proximity scenario, another target can be accurately detected in the single tracking heatmap.
  • ...and 3 more figures