MLS-Track: Multilevel Semantic Interaction in RMOT
Zeliang Ma, Song Yang, Zhe Cui, Zhicheng Zhao, Fei Su, Delong Liu, Jingyu Wang
TL;DR
This work tackles the challenge of language-guided multi-object tracking (RMOT) by introducing Refer-UE-City, a high-quality synthetic dataset generated with Unreal Engine 5 to reduce manual annotation costs, and MLS-Track, a multi-level semantic-guided tracking framework. MLS-Track progressively injects semantic information from text into the visual tracking pipeline via a Semantic Guidance Module and enhances cross-modal grounding with a Semantic Correlation Branch that aligns fused features with text space using CLIP. The approach achieves state-of-the-art performance on Refer-UE-City and Refer-KITTI, with ablations showing substantial gains from SGM and SCB and from using CLIP over RoBERTa for semantic alignment. These results demonstrate the practicality of synthetic data for RMOT and the effectiveness of layered cross-modal reasoning for robust, text-driven object tracking in surveillance contexts.
Abstract
The new trend in multi-object tracking task is to track objects of interest using natural language. However, the scarcity of paired prompt-instance data hinders its progress. To address this challenge, we propose a high-quality yet low-cost data generation method base on Unreal Engine 5 and construct a brand-new benchmark dataset, named Refer-UE-City, which primarily includes scenes from intersection surveillance videos, detailing the appearance and actions of people and vehicles. Specifically, it provides 14 videos with a total of 714 expressions, and is comparable in scale to the Refer-KITTI dataset. Additionally, we propose a multi-level semantic-guided multi-object framework called MLS-Track, where the interaction between the model and text is enhanced layer by layer through the introduction of Semantic Guidance Module (SGM) and Semantic Correlation Branch (SCB). Extensive experiments on Refer-UE-City and Refer-KITTI datasets demonstrate the effectiveness of our proposed framework and it achieves state-of-the-art performance. Code and datatsets will be available.
