Table of Contents
Fetching ...

Vision-Motion-Reference Alignment for Referring Multi-Object Tracking via Multi-Modal Large Language Models

Weiyi Lv, Ning Zhang, Hanyang Sun, Haoran Jiang, Kai Zhao, Jing Xiao, Dan Zeng

TL;DR

This work tackles Referring Multi-Object Tracking (RMOT) by addressing the misalignment between static language references and dynamic object motion. It introduces VMRMOT, a vision–motion–reference framework that derives a motion modality from object trajectories using multi-modal large language models (MLLMs) and fuses it with vision and reference cues through a hierarchical Vision–Motion–Reference Alignment (VMRA) and a Motion-Guided Prediction Head (MGPH). The approach achieves state-of-the-art results on Refer-KITTI and Refer-KITTI-V2, demonstrating substantial gains in HOTA, DetA, and IDF1, and underscoring the value of motion-aware descriptions and LoRA-finetuning of the MLLM. Overall, VMRMOT provides a robust, cross-modal RMOT solution with strong potential for real-world multi-object tracking tasks that require nuanced temporal understanding and natural-language references.

Abstract

Referring Multi-Object Tracking (RMOT) extends conventional multi-object tracking (MOT) by introducing natural language references for multi-modal fusion tracking. RMOT benchmarks only describe the object's appearance, relative positions, and initial motion states. This so-called static regulation fails to capture dynamic changes of the object motion, including velocity changes and motion direction shifts. This limitation not only causes a temporal discrepancy between static references and dynamic vision modality but also constrains multi-modal tracking performance. To address this limitation, we propose a novel Vision-Motion-Reference aligned RMOT framework, named VMRMOT. It integrates a motion modality extracted from object dynamics to enhance the alignment between vision modality and language references through multi-modal large language models (MLLMs). Specifically, we introduce motion-aware descriptions derived from object dynamic behaviors and, leveraging the powerful temporal-reasoning capabilities of MLLMs, extract motion features as the motion modality. We further design a Vision-Motion-Reference Alignment (VMRA) module to hierarchically align visual queries with motion and reference cues, enhancing their cross-modal consistency. In addition, a Motion-Guided Prediction Head (MGPH) is developed to explore motion modality to enhance the performance of the prediction head. To the best of our knowledge, VMRMOT is the first approach to employ MLLMs in the RMOT task for vision-reference alignment. Extensive experiments on multiple RMOT benchmarks demonstrate that VMRMOT outperforms existing state-of-the-art methods.

Vision-Motion-Reference Alignment for Referring Multi-Object Tracking via Multi-Modal Large Language Models

TL;DR

This work tackles Referring Multi-Object Tracking (RMOT) by addressing the misalignment between static language references and dynamic object motion. It introduces VMRMOT, a vision–motion–reference framework that derives a motion modality from object trajectories using multi-modal large language models (MLLMs) and fuses it with vision and reference cues through a hierarchical Vision–Motion–Reference Alignment (VMRA) and a Motion-Guided Prediction Head (MGPH). The approach achieves state-of-the-art results on Refer-KITTI and Refer-KITTI-V2, demonstrating substantial gains in HOTA, DetA, and IDF1, and underscoring the value of motion-aware descriptions and LoRA-finetuning of the MLLM. Overall, VMRMOT provides a robust, cross-modal RMOT solution with strong potential for real-world multi-object tracking tasks that require nuanced temporal understanding and natural-language references.

Abstract

Referring Multi-Object Tracking (RMOT) extends conventional multi-object tracking (MOT) by introducing natural language references for multi-modal fusion tracking. RMOT benchmarks only describe the object's appearance, relative positions, and initial motion states. This so-called static regulation fails to capture dynamic changes of the object motion, including velocity changes and motion direction shifts. This limitation not only causes a temporal discrepancy between static references and dynamic vision modality but also constrains multi-modal tracking performance. To address this limitation, we propose a novel Vision-Motion-Reference aligned RMOT framework, named VMRMOT. It integrates a motion modality extracted from object dynamics to enhance the alignment between vision modality and language references through multi-modal large language models (MLLMs). Specifically, we introduce motion-aware descriptions derived from object dynamic behaviors and, leveraging the powerful temporal-reasoning capabilities of MLLMs, extract motion features as the motion modality. We further design a Vision-Motion-Reference Alignment (VMRA) module to hierarchically align visual queries with motion and reference cues, enhancing their cross-modal consistency. In addition, a Motion-Guided Prediction Head (MGPH) is developed to explore motion modality to enhance the performance of the prediction head. To the best of our knowledge, VMRMOT is the first approach to employ MLLMs in the RMOT task for vision-reference alignment. Extensive experiments on multiple RMOT benchmarks demonstrate that VMRMOT outperforms existing state-of-the-art methods.

Paper Structure

This paper contains 28 sections, 25 equations, 8 figures, 9 tables.

Figures (8)

  • Figure 1: (a) illustrates the temporal discrepancy between the static reference “left cars which are parking” and the dynamic vision modality of objects 3 and 4, which leads to FP and FN. (b) shows the comparison between previous RMOT frameworks, which integrate only the vision and reference modalities, and our proposed VMRMOT, which incorporates vision, motion, and reference modalities. (c) shows the HOTA-DetA-AssA comparisons of different RMOT trackers on the Refer-KITTI dataset. Our VMRMOT achieves $53.00\%$ HOTA, $41.13\%$ DetA, and $68.41\%$ AssA.
  • Figure 2: The overall architecture of VMRMOT. VMRMOT consists of four parts: a frozen transformer, a motion feature extraction, a vision–motion–reference alignment, and a motion-guided prediction head.
  • Figure 3: Illustration of the Motion feature extraction pipeline. It consists of two stages: first, historical trajectories are converted into compact motion-aware descriptions; then, MLLMs are employed to extract motion features from these descriptions.
  • Figure 4: Illustration of MGPH, where motion and reference embeddings are fused to produce motion- and reference-aware features for prediction. MGPH consists of three branches: a class branch, a box branch, and a referring branch.
  • Figure 5: Qualitative comparison between TempRMOT and VMRMOT on the Refer-KITTI dataset. The red arrow indicates the noteworthy objects. Boxes of the same color represent the same ID. Best viewed in color and zoom-in.
  • ...and 3 more figures