Table of Contents
Fetching ...

NOVA: Next-step Open-Vocabulary Autoregression for 3D Multi-Object Tracking in Autonomous Driving

Kai Luo, Xu Wang, Rui Fan, Kailun Yang

TL;DR

Next-step Open-Vocabulary Autoregression (NOVA), an innovative paradigm that shifts 3D tracking from traditional fragmented distance-based matching toward generative spatio-temporal semantic modeling, and reformulates 3D trajectories as structured spatio-temporal semantic sequences.

Abstract

Generalizing across unknown targets is critical for open-world perception, yet existing 3D Multi-Object Tracking (3D MOT) pipelines remain limited by closed-set assumptions and ``semantic-blind'' heuristics. To address this, we propose Next-step Open-Vocabulary Autoregression (NOVA), an innovative paradigm that shifts 3D tracking from traditional fragmented distance-based matching toward generative spatio-temporal semantic modeling. NOVA reformulates 3D trajectories as structured spatio-temporal semantic sequences, enabling the simultaneous encoding of physical motion continuity and deep linguistic priors. By leveraging the autoregressive capabilities of Large Language Models (LLMs), we transform the tracking task into a principled process of next-step sequence completion. This mechanism allows the model to explicitly utilize the hierarchical structure of language space to resolve fine-grained semantic ambiguities and maintain identity consistency across complex long-range sequences through high-level commonsense reasoning. Extensive experiments on nuScenes, V2X-Seq-SPD, and KITTI demonstrate the superior performance of NOVA. Notably, on the nuScenes dataset, NOVA achieves an AMOTA of 22.41% for Novel categories, yielding a significant 20.21% absolute improvement over the baseline. These gains are realized through a compact 0.5B autoregressive model. Code will be available at https://github.com/xifen523/NOVA.

NOVA: Next-step Open-Vocabulary Autoregression for 3D Multi-Object Tracking in Autonomous Driving

TL;DR

Next-step Open-Vocabulary Autoregression (NOVA), an innovative paradigm that shifts 3D tracking from traditional fragmented distance-based matching toward generative spatio-temporal semantic modeling, and reformulates 3D trajectories as structured spatio-temporal semantic sequences.

Abstract

Generalizing across unknown targets is critical for open-world perception, yet existing 3D Multi-Object Tracking (3D MOT) pipelines remain limited by closed-set assumptions and ``semantic-blind'' heuristics. To address this, we propose Next-step Open-Vocabulary Autoregression (NOVA), an innovative paradigm that shifts 3D tracking from traditional fragmented distance-based matching toward generative spatio-temporal semantic modeling. NOVA reformulates 3D trajectories as structured spatio-temporal semantic sequences, enabling the simultaneous encoding of physical motion continuity and deep linguistic priors. By leveraging the autoregressive capabilities of Large Language Models (LLMs), we transform the tracking task into a principled process of next-step sequence completion. This mechanism allows the model to explicitly utilize the hierarchical structure of language space to resolve fine-grained semantic ambiguities and maintain identity consistency across complex long-range sequences through high-level commonsense reasoning. Extensive experiments on nuScenes, V2X-Seq-SPD, and KITTI demonstrate the superior performance of NOVA. Notably, on the nuScenes dataset, NOVA achieves an AMOTA of 22.41% for Novel categories, yielding a significant 20.21% absolute improvement over the baseline. These gains are realized through a compact 0.5B autoregressive model. Code will be available at https://github.com/xifen523/NOVA.
Paper Structure (12 sections, 4 equations, 3 figures, 7 tables)

This paper contains 12 sections, 4 equations, 3 figures, 7 tables.

Figures (3)

  • Figure 1: Comparison of 3D MOT paradigms under open-world category shifts. (a) Closed-set: a category-specific 3D detector yields class-limited proposals; unseen objects are suppressed and poorly tracked. (b) Semi-open-vocabulary: 2D open-vocabulary predictions are projected onto closed-set 3D proposals; tracking still relies on a downstream association tracker. (c) Open-vocabulary (ours): an open-vocabulary 3D detector outputs labeled 3D detections, and NOVA robustly associates both Base and Novel objects online.
  • Figure 2: The pipeline of the proposed NOVA framework. (1) Open-Vocabulary 3D Detection: Multi-modal inputs are processed to generate detections $D_t$ with open-vocabulary labels. (2) Serialization & Hybrid Prompting: A Geometry Encoder projects raw box features $f_{\text{raw}}$ into embeddings $E_{\text{geo}}$. These are interleaved with text using a Hybrid Prompting strategy that explicitly masks novel class labels (e.g., Unknown) to enforce geometric learning. (3) Autoregressive Association: The LLM predicts a Yes probability for candidate pairs to construct a cost matrix for Hungarian matching, driving the online Lifecycle Management (Birth/Death) of trajectories.
  • Figure 3: Qualitative comparison of OV-3D-MOT across datasets. Four representative scenes from different autonomous-driving benchmarks, each visualized over four consecutive frames ($t{=}1{\sim}4$). For each scene, we compare Open3DTrack ishaq2025open3dtrack (top) with NOVA (bottom). Callouts and arrows highlight typical open-vocabulary tracking failure cases, including Class switch, ID switch, and ID/Class switch.