Table of Contents
Fetching ...

RT-RMOT: A Dataset and Framework for RGB-Thermal Referring Multi-Object Tracking

Yanqiu Yu, Zhifan Jin, Sijia Chen, Tongfei Chu, En Yu, Liman Liu, Wenbing Tao

TL;DR

A new RGB-Thermal RMOT task, named RT-RMOT, which aims to fuse RGB appearance features with the illumination robustness of the thermal modality to enable all-day referring multi-object tracking and proposes RTrack, a framework built upon a multimodal large language model that integrates RGB, thermal, and textual features.

Abstract

Referring Multi-Object Tracking has attracted increasing attention due to its human-friendly interactive characteristics, yet it exhibits limitations in low-visibility conditions, such as nighttime, smoke, and other challenging scenarios. To overcome this limitation, we propose a new RGB-Thermal RMOT task, named RT-RMOT, which aims to fuse RGB appearance features with the illumination robustness of the thermal modality to enable all-day referring multi-object tracking. To promote research on RT-RMOT, we construct the first Referring Multi-Object Tracking dataset under RGB-Thermal modality, named RefRT. It contains 388 language descriptions, 1,250 tracked targets, and 166,147 Language-RGB-Thermal (L-RGB-T) triplets. Furthermore, we propose RTrack, a framework built upon a multimodal large language model (MLLM) that integrates RGB, thermal, and textual features. Since the initial framework still leaves room for improvement, we introduce a Group Sequence Policy Optimization (GSPO) strategy to further exploit the model's potential. To alleviate training instability during RL fine-tuning, we introduce a Clipped Advantage Scaling (CAS) strategy to suppress gradient explosion. In addition, we design Structured Output Reward and Comprehensive Detection Reward to balance exploration and exploitation, thereby improving the completeness and accuracy of target perception. Extensive experiments on the RefRT dataset demonstrate the effectiveness of the proposed RTrack framework.

RT-RMOT: A Dataset and Framework for RGB-Thermal Referring Multi-Object Tracking

TL;DR

A new RGB-Thermal RMOT task, named RT-RMOT, which aims to fuse RGB appearance features with the illumination robustness of the thermal modality to enable all-day referring multi-object tracking and proposes RTrack, a framework built upon a multimodal large language model that integrates RGB, thermal, and textual features.

Abstract

Referring Multi-Object Tracking has attracted increasing attention due to its human-friendly interactive characteristics, yet it exhibits limitations in low-visibility conditions, such as nighttime, smoke, and other challenging scenarios. To overcome this limitation, we propose a new RGB-Thermal RMOT task, named RT-RMOT, which aims to fuse RGB appearance features with the illumination robustness of the thermal modality to enable all-day referring multi-object tracking. To promote research on RT-RMOT, we construct the first Referring Multi-Object Tracking dataset under RGB-Thermal modality, named RefRT. It contains 388 language descriptions, 1,250 tracked targets, and 166,147 Language-RGB-Thermal (L-RGB-T) triplets. Furthermore, we propose RTrack, a framework built upon a multimodal large language model (MLLM) that integrates RGB, thermal, and textual features. Since the initial framework still leaves room for improvement, we introduce a Group Sequence Policy Optimization (GSPO) strategy to further exploit the model's potential. To alleviate training instability during RL fine-tuning, we introduce a Clipped Advantage Scaling (CAS) strategy to suppress gradient explosion. In addition, we design Structured Output Reward and Comprehensive Detection Reward to balance exploration and exploitation, thereby improving the completeness and accuracy of target perception. Extensive experiments on the RefRT dataset demonstrate the effectiveness of the proposed RTrack framework.
Paper Structure (25 sections, 10 equations, 7 figures, 6 tables, 1 algorithm)

This paper contains 25 sections, 10 equations, 7 figures, 6 tables, 1 algorithm.

Figures (7)

  • Figure 1: The difference between RMOT and RT-RMOT. RMOT model takes RGB images as input, but pedestrian positions cannot be reliably obtained from RGB alone, often leading to tracking failure. RT-RMOT model leverages thermal (T) images to obtain pedestrian contours and locations, and combines them with crosswalk regions provided by RGB images, enabling precise identification and tracking of people described by language.
  • Figure 2: Data annotation process. The annotation process consists of three steps: (1) Target Pre-selection and Annotation; (2) GPT-Assisted Generation of Attribute Descriptions; (3) Human Verification and Refinement. First, visually similar targets are selected, and bounding boxes are annotated across the entire sequence. Then, annotated key frames and task instructions are provided to GPT to analyze the target’s category, scene, appearance, motion, and spatial position. Finally, GPT-generated attributes are integrated into initial descriptions, which is refined through multiple annotators' review to produce the final language descriptions.
  • Figure 3: Visualization results on the RefRT dataset. (a) The word cloud of the RefRT dataset contains a rich set of keywords. (b) The RefRT dataset covers diverse scenes, targets and attributes. (c) After normalizing the three-dimensional data of the language descriptions in the RefRT dataset, the broad distribution characteristics of the dataset are revealed.
  • Figure 4: Overall pipeline of RTrack. The RTrack framework consists of three modules: (1) Large-model perception module: performs inference detection of the target in the current frame based on the language description. (2) Trajectory Prediction Module: predicts the target box in the current frame based on the historical trajectory. (3) Identity Association Module: matches the trajectory box and the detection box to generate the target ID. Among them, the GSPO algorithm includes three aspects: (a) CAS strategy: constrains relative rewards to avoid gradient explosion. (b) Structured output reward: requires the model to format the output and limits the output length. (c) Comprehensive detection reward: completes the output encouragement and accurate output based on the IoU value.
  • Figure 5: Qualitative zero-shot results of RTrack on the test set of RT-RMOT task.
  • ...and 2 more figures