Table of Contents
Fetching ...

Open3DTrack: Towards Open-Vocabulary 3D Multi-Object Tracking

Ayesha Ishaq, Mohamed El Amine Boudjoghra, Jean Lahoud, Fahad Shahbaz Khan, Salman Khan, Hisham Cholakkal, Rao Muhammad Anwer

TL;DR

OpenOpen3DTrack tackles open-vocabulary 3D multi-object tracking by reformulating 3D MOT to handle unseen object categories. It adapts a state-of-the-art 3D tracker into a class-agnostic framework and combines 3D proposals with 2D open-vocabulary detections via a vision-language model to label tracks, while introducing Track Consistency Scoring and a confidence-prediction head to stabilize predictions. The method relies on new open-vocabulary tracking splits derived from nuScenes to evaluate generalization to novel classes and demonstrates robust improvements over baselines across multiple detectors. The work advances autonomous driving perception by enabling reliable tracking of unseen objects in real-world 3D scenes, with code, models, and dataset splits publicly available.

Abstract

3D multi-object tracking plays a critical role in autonomous driving by enabling the real-time monitoring and prediction of multiple objects' movements. Traditional 3D tracking systems are typically constrained by predefined object categories, limiting their adaptability to novel, unseen objects in dynamic environments. To address this limitation, we introduce open-vocabulary 3D tracking, which extends the scope of 3D tracking to include objects beyond predefined categories. We formulate the problem of open-vocabulary 3D tracking and introduce dataset splits designed to represent various open-vocabulary scenarios. We propose a novel approach that integrates open-vocabulary capabilities into a 3D tracking framework, allowing for generalization to unseen object classes. Our method effectively reduces the performance gap between tracking known and novel objects through strategic adaptation. Experimental results demonstrate the robustness and adaptability of our method in diverse outdoor driving scenarios. To the best of our knowledge, this work is the first to address open-vocabulary 3D tracking, presenting a significant advancement for autonomous systems in real-world settings. Code, trained models, and dataset splits are available publicly.

Open3DTrack: Towards Open-Vocabulary 3D Multi-Object Tracking

TL;DR

OpenOpen3DTrack tackles open-vocabulary 3D multi-object tracking by reformulating 3D MOT to handle unseen object categories. It adapts a state-of-the-art 3D tracker into a class-agnostic framework and combines 3D proposals with 2D open-vocabulary detections via a vision-language model to label tracks, while introducing Track Consistency Scoring and a confidence-prediction head to stabilize predictions. The method relies on new open-vocabulary tracking splits derived from nuScenes to evaluate generalization to novel classes and demonstrates robust improvements over baselines across multiple detectors. The work advances autonomous driving perception by enabling reliable tracking of unseen objects in real-world 3D scenes, with code, models, and dataset splits publicly available.

Abstract

3D multi-object tracking plays a critical role in autonomous driving by enabling the real-time monitoring and prediction of multiple objects' movements. Traditional 3D tracking systems are typically constrained by predefined object categories, limiting their adaptability to novel, unseen objects in dynamic environments. To address this limitation, we introduce open-vocabulary 3D tracking, which extends the scope of 3D tracking to include objects beyond predefined categories. We formulate the problem of open-vocabulary 3D tracking and introduce dataset splits designed to represent various open-vocabulary scenarios. We propose a novel approach that integrates open-vocabulary capabilities into a 3D tracking framework, allowing for generalization to unseen object classes. Our method effectively reduces the performance gap between tracking known and novel objects through strategic adaptation. Experimental results demonstrate the robustness and adaptability of our method in diverse outdoor driving scenarios. To the best of our knowledge, this work is the first to address open-vocabulary 3D tracking, presenting a significant advancement for autonomous systems in real-world settings. Code, trained models, and dataset splits are available publicly.
Paper Structure (14 sections, 2 equations, 4 figures, 3 tables)

This paper contains 14 sections, 2 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Open3DTrack We train the tracking system on classes from $\mathcal{C}^{base}$ and prompt it with unseen categories $\textit{bicycle, bus} \in \mathcal{C}^{novel}$ at test time. Here, we show the tracked 3D bounding boxes projected on multiple frames and views. Our method can track and label both known and unknown object classes across consecutive 3D frames.
  • Figure 2: System overview. Open3DTrack leverages 3D proposals from base classes $C^{base}$ to train the 3D tracker in a class-agnostic manner, enabling it to construct tracks and predict confidence scores. During inference, the system classifies 3D proposals using open-vocabulary categories from both base and novel classes $C^{base} \cup C^{novel}$, utilizing 2D image cues and pretrained vision-language model. This process labels the tracks output by the tracker, effectively generating open-vocabulary 3D tracks.
  • Figure 3: Open-vocabulary dataset splits. Our proposed dataset splits for known and unknown categories are based on statistics from the nuScenesnuscenes2019 dataset, which are shown here. The first split considers the least occurring objects as novel. In the second split, we analyze the motion patterns of objects in urban settings, motivating our urban vs highway split. Novel classes in this split exhibited the highest percentage of movement. In the final split, we use the average points per box to ensure a similarly diverse object distribution in the novel split in terms of object volume.
  • Figure 4: Qualitative comparison. We show the output of our method and the baseline on challenging examples from nuScenes. The top two rows show an occluded novel class object, bus, which is tracked successfully by ours, while it changes IDs in the baseline output. The bottom two rows show an example of poor lighting and blurred novel object, pedestrian, tracked by ours but generating a fragmented track by the baseline. Additional results are presented in the accompanying video.