Table of Contents
Fetching ...

MLS-Track: Multilevel Semantic Interaction in RMOT

Zeliang Ma, Song Yang, Zhe Cui, Zhicheng Zhao, Fei Su, Delong Liu, Jingyu Wang

TL;DR

This work tackles the challenge of language-guided multi-object tracking (RMOT) by introducing Refer-UE-City, a high-quality synthetic dataset generated with Unreal Engine 5 to reduce manual annotation costs, and MLS-Track, a multi-level semantic-guided tracking framework. MLS-Track progressively injects semantic information from text into the visual tracking pipeline via a Semantic Guidance Module and enhances cross-modal grounding with a Semantic Correlation Branch that aligns fused features with text space using CLIP. The approach achieves state-of-the-art performance on Refer-UE-City and Refer-KITTI, with ablations showing substantial gains from SGM and SCB and from using CLIP over RoBERTa for semantic alignment. These results demonstrate the practicality of synthetic data for RMOT and the effectiveness of layered cross-modal reasoning for robust, text-driven object tracking in surveillance contexts.

Abstract

The new trend in multi-object tracking task is to track objects of interest using natural language. However, the scarcity of paired prompt-instance data hinders its progress. To address this challenge, we propose a high-quality yet low-cost data generation method base on Unreal Engine 5 and construct a brand-new benchmark dataset, named Refer-UE-City, which primarily includes scenes from intersection surveillance videos, detailing the appearance and actions of people and vehicles. Specifically, it provides 14 videos with a total of 714 expressions, and is comparable in scale to the Refer-KITTI dataset. Additionally, we propose a multi-level semantic-guided multi-object framework called MLS-Track, where the interaction between the model and text is enhanced layer by layer through the introduction of Semantic Guidance Module (SGM) and Semantic Correlation Branch (SCB). Extensive experiments on Refer-UE-City and Refer-KITTI datasets demonstrate the effectiveness of our proposed framework and it achieves state-of-the-art performance. Code and datatsets will be available.

MLS-Track: Multilevel Semantic Interaction in RMOT

TL;DR

This work tackles the challenge of language-guided multi-object tracking (RMOT) by introducing Refer-UE-City, a high-quality synthetic dataset generated with Unreal Engine 5 to reduce manual annotation costs, and MLS-Track, a multi-level semantic-guided tracking framework. MLS-Track progressively injects semantic information from text into the visual tracking pipeline via a Semantic Guidance Module and enhances cross-modal grounding with a Semantic Correlation Branch that aligns fused features with text space using CLIP. The approach achieves state-of-the-art performance on Refer-UE-City and Refer-KITTI, with ablations showing substantial gains from SGM and SCB and from using CLIP over RoBERTa for semantic alignment. These results demonstrate the practicality of synthetic data for RMOT and the effectiveness of layered cross-modal reasoning for robust, text-driven object tracking in surveillance contexts.

Abstract

The new trend in multi-object tracking task is to track objects of interest using natural language. However, the scarcity of paired prompt-instance data hinders its progress. To address this challenge, we propose a high-quality yet low-cost data generation method base on Unreal Engine 5 and construct a brand-new benchmark dataset, named Refer-UE-City, which primarily includes scenes from intersection surveillance videos, detailing the appearance and actions of people and vehicles. Specifically, it provides 14 videos with a total of 714 expressions, and is comparable in scale to the Refer-KITTI dataset. Additionally, we propose a multi-level semantic-guided multi-object framework called MLS-Track, where the interaction between the model and text is enhanced layer by layer through the introduction of Semantic Guidance Module (SGM) and Semantic Correlation Branch (SCB). Extensive experiments on Refer-UE-City and Refer-KITTI datasets demonstrate the effectiveness of our proposed framework and it achieves state-of-the-art performance. Code and datatsets will be available.
Paper Structure (18 sections, 3 equations, 8 figures, 5 tables)

This paper contains 18 sections, 3 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Some examples of the proposed Refer-UE-City dataset. It provides high-quality multi-object annotations for each scene based on different language prompt. Each box color represents a unique identity.
  • Figure 2: Pipeline for generating a language prompt dataset, consisting of three steps: generate a MOT dataset, identifying language elements, and combination. Firstly, we utilize UE5 to generate a virtual world and record labels for each trajectory within it. Secondly, 3D models automatically output appearance and category information, while we manually annotate their motion information. Lastly, textual descriptions are created through the combination of Language Elements.
  • Figure 3: Statistics of Refer-UE-City on (a) word cloud and (b) distribution of Instance number per expression.(c) Distribution of the ratio of referent frames covering video.(d) Distribution of the duration of each referenced instances.
  • Figure 4: The overall architecture of MLS-Track. It is an online cross-modal tracker that interacts with semantics layer by layer. Firstly, Fusion Encode conducts early fusion of visual and textual features. Before entering the decoder layer, the Semantic Guidance module embeds language features into query. Subsequently, semantic queries decode cross-modal embeddings. Finally, the queries are projected into the textual space to measure their similarity with the encoded textual features of clips, facilitating the prediction of referenced objects.
  • Figure 5: Three designs of Semantic Guidance Module. The dashed line represents the residual structure.
  • ...and 3 more figures