Table of Contents
Fetching ...

ReferGPT: Towards Zero-Shot Referring Multi-Object Tracking

Tzoulio Chamiti, Leandro Di Bella, Adrian Munteanu, Nikos Deligiannis

TL;DR

ReferGPT introduces a zero-shot Referring Multi-Object Tracking framework that fuses 3D spatial information with a Multi-Modal Large Language Model to generate spatially grounded object captions. A hybrid matching module combining CLIP-based semantic embeddings and fuzzy substring matching aligns these captions with user queries, enabling open-set referring without task-specific training. The method, built on a tracking-by-detection backbone with 3D Kalman filtering, achieves competitive HOTA scores on Refer-KITTI datasets and demonstrates strong association accuracy, while ablations validate the contribution of each component. Although computationally intensive due to MLLM usage, ReferGPT establishes a flexible, training-free approach to RMOT with potential for efficiency improvements through model distillation and faster inference.

Abstract

Tracking multiple objects based on textual queries is a challenging task that requires linking language understanding with object association across frames. Previous works typically train the whole process end-to-end or integrate an additional referring text module into a multi-object tracker, but they both require supervised training and potentially struggle with generalization to open-set queries. In this work, we introduce ReferGPT, a novel zero-shot referring multi-object tracking framework. We provide a multi-modal large language model (MLLM) with spatial knowledge enabling it to generate 3D-aware captions. This enhances its descriptive capabilities and supports a more flexible referring vocabulary without training. We also propose a robust query-matching strategy, leveraging CLIP-based semantic encoding and fuzzy matching to associate MLLM generated captions with user queries. Extensive experiments on Refer-KITTI, Refer-KITTIv2 and Refer-KITTI+ demonstrate that ReferGPT achieves competitive performance against trained methods, showcasing its robustness and zero-shot capabilities in autonomous driving. The codes are available on https://github.com/Tzoulio/ReferGPT

ReferGPT: Towards Zero-Shot Referring Multi-Object Tracking

TL;DR

ReferGPT introduces a zero-shot Referring Multi-Object Tracking framework that fuses 3D spatial information with a Multi-Modal Large Language Model to generate spatially grounded object captions. A hybrid matching module combining CLIP-based semantic embeddings and fuzzy substring matching aligns these captions with user queries, enabling open-set referring without task-specific training. The method, built on a tracking-by-detection backbone with 3D Kalman filtering, achieves competitive HOTA scores on Refer-KITTI datasets and demonstrates strong association accuracy, while ablations validate the contribution of each component. Although computationally intensive due to MLLM usage, ReferGPT establishes a flexible, training-free approach to RMOT with potential for efficiency improvements through model distillation and faster inference.

Abstract

Tracking multiple objects based on textual queries is a challenging task that requires linking language understanding with object association across frames. Previous works typically train the whole process end-to-end or integrate an additional referring text module into a multi-object tracker, but they both require supervised training and potentially struggle with generalization to open-set queries. In this work, we introduce ReferGPT, a novel zero-shot referring multi-object tracking framework. We provide a multi-modal large language model (MLLM) with spatial knowledge enabling it to generate 3D-aware captions. This enhances its descriptive capabilities and supports a more flexible referring vocabulary without training. We also propose a robust query-matching strategy, leveraging CLIP-based semantic encoding and fuzzy matching to associate MLLM generated captions with user queries. Extensive experiments on Refer-KITTI, Refer-KITTIv2 and Refer-KITTI+ demonstrate that ReferGPT achieves competitive performance against trained methods, showcasing its robustness and zero-shot capabilities in autonomous driving. The codes are available on https://github.com/Tzoulio/ReferGPT

Paper Structure

This paper contains 14 sections, 4 equations, 4 figures, 8 tables.

Figures (4)

  • Figure 1: Comparison between ReferGPT and previous RMOT methods. (a) End-to-end models jointly learn tracking and referring. (b) Tracking-by-detection frameworks follow a modular approach and train the referring text module. (c) Our method builds on (b) while eliminating the need for training, enabling zero-shot referring MOT.
  • Figure 2: Overview of the proposed ReferGPT framework. Given LiDAR $I_{t,3D}$ and image $I_{t,2D}$ inputs, a 3D object detector extracts object candidates ${R}_{t,3D}$, which are then tracked using a tracking-by-detection approach with a Kalman filter for trajectory prediction. A Multi-Modal Large Language Model generates descriptive captions $\mathbf{D^i_t}$ for each object by leveraging object coordinates $C^i$ and appearance features $I_c^i$. These captions are then matched against the referring query Q using a matching module. The final matched trajectories $T^i$ are filtered and associated with the query to produce the final output.
  • Figure 3: Our Matching Module. Given an object description $\textbf{D}_t^i$ and a referring query Q, we calculate the total matching score $S_T$ between them.
  • Figure : Query: Car in black