Table of Contents
Fetching ...

Bootstrapping Referring Multi-Object Tracking

Yani Zhang, Dongming Wu, Wencheng Han, Xingping Dong

TL;DR

RMOT extends referring grounding to multi-object and temporal contexts by introducing Refer-KITTI and Refer-KITTI-V2 benchmarks, a semi-automatic labeling pipeline, and TempRMOT, a Transformer-based model with a Temporal Enhancement Module that refines object queries using long-range spatio-temporal interactions. The approach achieves state-of-the-art results on all benchmarks, with notable improvements in HOTA (e.g., ~4% on Refer-KITTI-V2) and substantial gains in KITTI when incorporating temporal cues. By modeling temporal dynamics directly in the query representations, TempRMOT improves both grounding and tracking under diverse linguistic expressions, including implicit and motion-based descriptions. The work also provides code and data access to advance research in language-conditioned video understanding and RMOT.

Abstract

Referring understanding is a fundamental task that bridges natural language and visual content by localizing objects described in free-form expressions. However, existing works are constrained by limited language expressiveness, lacking the capacity to model object dynamics in spatial numbers and temporal states. To address these limitations, we introduce a new and general referring understanding task, termed referring multi-object tracking (RMOT). Its core idea is to employ a language expression as a semantic cue to guide the prediction of multi-object tracking, comprehensively accounting for variations in object quantity and temporal semantics. Along with RMOT, we introduce a RMOT benchmark named Refer-KITTI-V2, featuring scalable and diverse language expressions. To efficiently generate high-quality annotations covering object dynamics with minimal manual effort, we propose a semi-automatic labeling pipeline that formulates a total of 9,758 language prompts. In addition, we propose TempRMOT, an elegant end-to-end Transformer-based framework for RMOT. At its core is a query-driven Temporal Enhancement Module that represents each object as a Transformer query, enabling long-term spatial-temporal interactions with other objects and past frames to efficiently refine these queries. TempRMOT achieves state-of-the-art performance on both Refer-KITTI and Refer-KITTI-V2, demonstrating the effectiveness of our approach. The source code and dataset is available at https://github.com/zyn213/TempRMOT.

Bootstrapping Referring Multi-Object Tracking

TL;DR

RMOT extends referring grounding to multi-object and temporal contexts by introducing Refer-KITTI and Refer-KITTI-V2 benchmarks, a semi-automatic labeling pipeline, and TempRMOT, a Transformer-based model with a Temporal Enhancement Module that refines object queries using long-range spatio-temporal interactions. The approach achieves state-of-the-art results on all benchmarks, with notable improvements in HOTA (e.g., ~4% on Refer-KITTI-V2) and substantial gains in KITTI when incorporating temporal cues. By modeling temporal dynamics directly in the query representations, TempRMOT improves both grounding and tracking under diverse linguistic expressions, including implicit and motion-based descriptions. The work also provides code and data access to advance research in language-conditioned video understanding and RMOT.

Abstract

Referring understanding is a fundamental task that bridges natural language and visual content by localizing objects described in free-form expressions. However, existing works are constrained by limited language expressiveness, lacking the capacity to model object dynamics in spatial numbers and temporal states. To address these limitations, we introduce a new and general referring understanding task, termed referring multi-object tracking (RMOT). Its core idea is to employ a language expression as a semantic cue to guide the prediction of multi-object tracking, comprehensively accounting for variations in object quantity and temporal semantics. Along with RMOT, we introduce a RMOT benchmark named Refer-KITTI-V2, featuring scalable and diverse language expressions. To efficiently generate high-quality annotations covering object dynamics with minimal manual effort, we propose a semi-automatic labeling pipeline that formulates a total of 9,758 language prompts. In addition, we propose TempRMOT, an elegant end-to-end Transformer-based framework for RMOT. At its core is a query-driven Temporal Enhancement Module that represents each object as a Transformer query, enabling long-term spatial-temporal interactions with other objects and past frames to efficiently refine these queries. TempRMOT achieves state-of-the-art performance on both Refer-KITTI and Refer-KITTI-V2, demonstrating the effectiveness of our approach. The source code and dataset is available at https://github.com/zyn213/TempRMOT.
Paper Structure (20 sections, 14 equations, 11 figures, 5 tables)

This paper contains 20 sections, 14 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: Representative examples from RMOT. The expression query can refer to multiple objects of interest (a), and captures the short-term status with accurate labels (b).
  • Figure 1: Comparison of Refer-KITTI with existing datasets. Refer-YV means Refer-Youtube-VOS. '-' means unavailable. Exp. means Expressions
  • Figure 2: Comparison of RMOT datasets. GroOT$^*$ represents the MOT17 subset with tracklet captions. Exp. means Expressions. Distinct exp. refer to the number of unique expressions. Refer-KITTI-V2 has the most expressions, including implicit expressions.
  • Figure 3: Language Prompt Annotation Pipeline consists of three steps: language item collection, prompt generation, and prompt expansion. Firstly, we use an efficient labeling tool to associate instances in each video with language elements at low human cost. Then, we manually create 2719 accurate language descriptions. Finally, leveraging the powerful language understanding capabilities of large language models, we expand the new annotations with language descriptions.
  • Figure 4: Word Cloud of all natural language expressions on our proposed Refer-KITTI-V2.
  • ...and 6 more figures