Bootstrapping Referring Multi-Object Tracking
Yani Zhang, Dongming Wu, Wencheng Han, Xingping Dong
TL;DR
RMOT extends referring grounding to multi-object and temporal contexts by introducing Refer-KITTI and Refer-KITTI-V2 benchmarks, a semi-automatic labeling pipeline, and TempRMOT, a Transformer-based model with a Temporal Enhancement Module that refines object queries using long-range spatio-temporal interactions. The approach achieves state-of-the-art results on all benchmarks, with notable improvements in HOTA (e.g., ~4% on Refer-KITTI-V2) and substantial gains in KITTI when incorporating temporal cues. By modeling temporal dynamics directly in the query representations, TempRMOT improves both grounding and tracking under diverse linguistic expressions, including implicit and motion-based descriptions. The work also provides code and data access to advance research in language-conditioned video understanding and RMOT.
Abstract
Referring understanding is a fundamental task that bridges natural language and visual content by localizing objects described in free-form expressions. However, existing works are constrained by limited language expressiveness, lacking the capacity to model object dynamics in spatial numbers and temporal states. To address these limitations, we introduce a new and general referring understanding task, termed referring multi-object tracking (RMOT). Its core idea is to employ a language expression as a semantic cue to guide the prediction of multi-object tracking, comprehensively accounting for variations in object quantity and temporal semantics. Along with RMOT, we introduce a RMOT benchmark named Refer-KITTI-V2, featuring scalable and diverse language expressions. To efficiently generate high-quality annotations covering object dynamics with minimal manual effort, we propose a semi-automatic labeling pipeline that formulates a total of 9,758 language prompts. In addition, we propose TempRMOT, an elegant end-to-end Transformer-based framework for RMOT. At its core is a query-driven Temporal Enhancement Module that represents each object as a Transformer query, enabling long-term spatial-temporal interactions with other objects and past frames to efficiently refine these queries. TempRMOT achieves state-of-the-art performance on both Refer-KITTI and Refer-KITTI-V2, demonstrating the effectiveness of our approach. The source code and dataset is available at https://github.com/zyn213/TempRMOT.
