Table of Contents
Fetching ...

Visible-Thermal Multiple Object Tracking: Large-scale Video Dataset and Progressive Fusion Approach

Yabin Zhu, Qianwu Wang, Chenglong Li, Jin Tang, Zhixiang Huang

TL;DR

This paper introduces VT-MOT, a large-scale visible-thermal video benchmark for multi-object tracking with 582 sequence pairs, 401k frame pairs, and 3.99 million dense annotations, collected across UAV, surveillance, and handheld platforms. It provides frame-by-frame spatio-temporal alignment to enable robust cross-modal fusion and proposes PFTrack, a progressive fusion tracking framework that combines temporal cross-attention with multimodal cross-attention to exploit complementary visible and thermal cues. Extensive experiments demonstrate PFTrack’s superiority over state-of-the-art MOT methods on VT-MOT, highlighting notable gains in MOTA, IDF1, and HOTA, and validate the benefits of both temporal and modality fusion through ablations. The dataset and method collectively advance visible-thermal MOT, offering practical impact for robust tracking in low-light, adverse weather, and long-range scenarios, while outlining future directions for efficiency, large-model integration, and cross-modal alignment research.

Abstract

The complementary benefits from visible and thermal infrared data are widely utilized in various computer vision task, such as visual tracking, semantic segmentation and object detection, but rarely explored in Multiple Object Tracking (MOT). In this work, we contribute a large-scale Visible-Thermal video benchmark for MOT, called VT-MOT. VT-MOT has the following main advantages. 1) The data is large scale and high diversity. VT-MOT includes 582 video sequence pairs, 401k frame pairs from surveillance, drone, and handheld platforms. 2) The cross-modal alignment is highly accurate. We invite several professionals to perform both spatial and temporal alignment frame by frame. 3) The annotation is dense and high-quality. VT-MOT has 3.99 million annotation boxes annotated and double-checked by professionals, including heavy occlusion and object re-acquisition (object disappear and reappear) challenges. To provide a strong baseline, we design a simple yet effective tracking framework, which effectively fuses temporal information and complementary information of two modalities in a progressive manner, for robust visible-thermal MOT. A comprehensive experiment are conducted on VT-MOT and the results prove the superiority and effectiveness of the proposed method compared with state-of-the-art methods. From the evaluation results and analysis, we specify several potential future directions for visible-thermal MOT. The project is released in https://github.com/wqw123wqw/PFTrack.

Visible-Thermal Multiple Object Tracking: Large-scale Video Dataset and Progressive Fusion Approach

TL;DR

This paper introduces VT-MOT, a large-scale visible-thermal video benchmark for multi-object tracking with 582 sequence pairs, 401k frame pairs, and 3.99 million dense annotations, collected across UAV, surveillance, and handheld platforms. It provides frame-by-frame spatio-temporal alignment to enable robust cross-modal fusion and proposes PFTrack, a progressive fusion tracking framework that combines temporal cross-attention with multimodal cross-attention to exploit complementary visible and thermal cues. Extensive experiments demonstrate PFTrack’s superiority over state-of-the-art MOT methods on VT-MOT, highlighting notable gains in MOTA, IDF1, and HOTA, and validate the benefits of both temporal and modality fusion through ablations. The dataset and method collectively advance visible-thermal MOT, offering practical impact for robust tracking in low-light, adverse weather, and long-range scenarios, while outlining future directions for efficiency, large-model integration, and cross-modal alignment research.

Abstract

The complementary benefits from visible and thermal infrared data are widely utilized in various computer vision task, such as visual tracking, semantic segmentation and object detection, but rarely explored in Multiple Object Tracking (MOT). In this work, we contribute a large-scale Visible-Thermal video benchmark for MOT, called VT-MOT. VT-MOT has the following main advantages. 1) The data is large scale and high diversity. VT-MOT includes 582 video sequence pairs, 401k frame pairs from surveillance, drone, and handheld platforms. 2) The cross-modal alignment is highly accurate. We invite several professionals to perform both spatial and temporal alignment frame by frame. 3) The annotation is dense and high-quality. VT-MOT has 3.99 million annotation boxes annotated and double-checked by professionals, including heavy occlusion and object re-acquisition (object disappear and reappear) challenges. To provide a strong baseline, we design a simple yet effective tracking framework, which effectively fuses temporal information and complementary information of two modalities in a progressive manner, for robust visible-thermal MOT. A comprehensive experiment are conducted on VT-MOT and the results prove the superiority and effectiveness of the proposed method compared with state-of-the-art methods. From the evaluation results and analysis, we specify several potential future directions for visible-thermal MOT. The project is released in https://github.com/wqw123wqw/PFTrack.
Paper Structure (24 sections, 3 equations, 8 figures, 10 tables)

This paper contains 24 sections, 3 equations, 8 figures, 10 tables.

Figures (8)

  • Figure 1: Comparison of our dataset with mainstream multiple object tacking datasets in terms of the number of frames and annotated bounding boxes. The data volume units for frames and annotated bounding boxes are 1k and 10k, respectively. Here, this BDD100K is the MOT subset of BDD100K.
  • Figure 2: Some sample frames in VT-MOT.
  • Figure 3: The number and percentage of IDs and boxes for each category in entire VT-MOT.
  • Figure 4: Registration samples.
  • Figure 5: The scale distribution of bounding boxes in our dataset. The horizontal coordinate represents the square root of the area of the bounding box. The vertical coordinate indicates the number of boxes in each scale sub-interval.
  • ...and 3 more figures