Table of Contents
Fetching ...

Multi-Granularity Language-Guided Training for Multi-Object Tracking

Yuhao Li, Jiale Cao, Muzammal Naseer, Yu Zhu, Jinqiu Sun, Yanning Zhang, Fahad Shahbaz Khan

TL;DR

LG-MOT introduces language-guided training for multi-object tracking by leveraging instance- and scene-level language descriptions during training, distilled via a frozen CLIP text encoder to complement visual features. The framework extends MOT datasets with language annotations and integrates two distillation losses, ISG and SPG, into a SUSHI-based tracking architecture, while inference remains visual-only. Empirical results on MOT17, DanceTrack, and SportsMOT reach state-of-the-art performance and show strong cross-domain generalization, including significant IDF1 gains in indoor-to-outdoor transfers. This approach demonstrates that language priors can robustly shape data association under occlusion, blur, and domain shifts, with practical impact for more reliable real-world tracking systems.

Abstract

Most existing multi-object tracking methods typically learn visual tracking features via maximizing dis-similarities of different instances and minimizing similarities of the same instance. While such a feature learning scheme achieves promising performance, learning discriminative features solely based on visual information is challenging especially in case of environmental interference such as occlusion, blur and domain variance. In this work, we argue that multi-modal language-driven features provide complementary information to classical visual features, thereby aiding in improving the robustness to such environmental interference. To this end, we propose a new multi-object tracking framework, named LG-MOT, that explicitly leverages language information at different levels of granularity (scene-and instance-level) and combines it with standard visual features to obtain discriminative representations. To develop LG-MOT, we annotate existing MOT datasets with scene-and instance-level language descriptions. We then encode both instance-and scene-level language information into high-dimensional embeddings, which are utilized to guide the visual features during training. At inference, our LG-MOT uses the standard visual features without relying on annotated language descriptions. Extensive experiments on three benchmarks, MOT17, DanceTrack and SportsMOT, reveal the merits of the proposed contributions leading to state-of-the-art performance. On the DanceTrack test set, our LG-MOT achieves an absolute gain of 2.2\% in terms of target object association (IDF1 score), compared to the baseline using only visual features. Further, our LG-MOT exhibits strong cross-domain generalizability. The dataset and code will be available at https://github.com/WesLee88524/LG-MOT.

Multi-Granularity Language-Guided Training for Multi-Object Tracking

TL;DR

LG-MOT introduces language-guided training for multi-object tracking by leveraging instance- and scene-level language descriptions during training, distilled via a frozen CLIP text encoder to complement visual features. The framework extends MOT datasets with language annotations and integrates two distillation losses, ISG and SPG, into a SUSHI-based tracking architecture, while inference remains visual-only. Empirical results on MOT17, DanceTrack, and SportsMOT reach state-of-the-art performance and show strong cross-domain generalization, including significant IDF1 gains in indoor-to-outdoor transfers. This approach demonstrates that language priors can robustly shape data association under occlusion, blur, and domain shifts, with practical impact for more reliable real-world tracking systems.

Abstract

Most existing multi-object tracking methods typically learn visual tracking features via maximizing dis-similarities of different instances and minimizing similarities of the same instance. While such a feature learning scheme achieves promising performance, learning discriminative features solely based on visual information is challenging especially in case of environmental interference such as occlusion, blur and domain variance. In this work, we argue that multi-modal language-driven features provide complementary information to classical visual features, thereby aiding in improving the robustness to such environmental interference. To this end, we propose a new multi-object tracking framework, named LG-MOT, that explicitly leverages language information at different levels of granularity (scene-and instance-level) and combines it with standard visual features to obtain discriminative representations. To develop LG-MOT, we annotate existing MOT datasets with scene-and instance-level language descriptions. We then encode both instance-and scene-level language information into high-dimensional embeddings, which are utilized to guide the visual features during training. At inference, our LG-MOT uses the standard visual features without relying on annotated language descriptions. Extensive experiments on three benchmarks, MOT17, DanceTrack and SportsMOT, reveal the merits of the proposed contributions leading to state-of-the-art performance. On the DanceTrack test set, our LG-MOT achieves an absolute gain of 2.2\% in terms of target object association (IDF1 score), compared to the baseline using only visual features. Further, our LG-MOT exhibits strong cross-domain generalizability. The dataset and code will be available at https://github.com/WesLee88524/LG-MOT.
Paper Structure (15 sections, 2 equations, 4 figures, 8 tables)

This paper contains 15 sections, 2 equations, 4 figures, 8 tables.

Figures (4)

  • Figure 1: (a) Extending MOT datasets with instance-and scene-level language descriptions to design a language-guided MOT method. Here, we show example instance-and scene-level annotated descriptions for different frames in a video. (b) Intra-domain performance comparison between our LG-MOT and the baseline when training on MOT17 train set and testing on MOT17 test set. (c) Cross-domain performance comparison when training on MOT17 train set comprising predominantly outdoor scenes and testing on DanceTrack test set comprising indoor scenes. Here, IDF1, HOTA, and MOTA metrics are higher the better, whereas IDSW is lower the better. Our LG-MOT achieves superior performance compared to the baseline only using the visual information.
  • Figure 2: Overview of the annotation pipeline and our framework LG-MOT. We first place the instance crop into three frozen visual-language models to obtain a textual description of the instance's tag, attributes, and caption. Then, we use a Large Language Model in conjunction with the design questions to obtain instance-level language descriptions. Since there are not many scenes and they are easily distinguishable, we directly label them manually at scene level. During training, our ISG module aligns each node embedding $\phi(b_i^k)$ with instance-level descriptions embeddings $\varphi_i$, while our SPG module aligns edge embeddings $\hat{E}_{(u,v)}$ with scene-level descriptions embeddings $\varphi_s$ to guide correlation estimation after message passing. Our approach does not require language description during inference.
  • Figure 3: Examples of the multi-granularity language annotations. Each scene has only one scene-level language description, and each object in the sequence has only one instance-level language description.
  • Figure 4: Visualization of maintaining object identity. (a) Intra-domain evaluation results showing. We train and test our model both on the MOT17 dataset. (b) Cross-domain evaluation results showing. We train the model on the MOT17 dataset and test on the DanceTrack test set. Results show that our model using scene-and instance-level language description always tracks the same object and ID is not switched in both situations. It can not only improve the tracking performance of our model on the same distributed data but also improve the generalization ability. Additional results are presented in the supplementary material.