Table of Contents
Fetching ...

Delving into Dynamic Scene Cue-Consistency for Robust 3D Multi-Object Tracking

Haonan Zhang, Xinyao Wang, Boxi Wu, Tu Zheng, Wang Yunhua, Zheng Yang

TL;DR

<3-5 sentence high-level summary> DSC-Track addresses robust 3D multi-object tracking in dynamic scenes by shifting focus from individual object motion to cue-consistency across spatial neighborhoods. It introduces a unified spatiotemporal encoder using rotation-invariant Point Pair Features, a Geometric Inject Attention mechanism, a temporal track token, and a cue-consistent cross-attention module with a memory-update scheme. The approach achieves state-of-the-art AMOTA on nuScenes (val and test) and shows strong generalization to Waymo, while running at real-time speeds. This work demonstrates the effectiveness of higher-order spatial relational cues for stable data association in crowded driving scenarios.

Abstract

3D multi-object tracking is a critical and challenging task in the field of autonomous driving. A common paradigm relies on modeling individual object motion, e.g., Kalman filters, to predict trajectories. While effective in simple scenarios, this approach often struggles in crowded environments or with inaccurate detections, as it overlooks the rich geometric relationships between objects. This highlights the need to leverage spatial cues. However, existing geometry-aware methods can be susceptible to interference from irrelevant objects, leading to ambiguous features and incorrect associations. To address this, we propose focusing on cue-consistency: identifying and matching stable spatial patterns over time. We introduce the Dynamic Scene Cue-Consistency Tracker (DSC-Track) to implement this principle. Firstly, we design a unified spatiotemporal encoder using Point Pair Features (PPF) to learn discriminative trajectory embeddings while suppressing interference. Secondly, our cue-consistency transformer module explicitly aligns consistent feature representations between historical tracks and current detections. Finally, a dynamic update mechanism preserves salient spatiotemporal information for stable online tracking. Extensive experiments on the nuScenes and Waymo Open Datasets validate the effectiveness and robustness of our approach. On the nuScenes benchmark, for instance, our method achieves state-of-the-art performance, reaching 73.2% and 70.3% AMOTA on the validation and test sets, respectively.

Delving into Dynamic Scene Cue-Consistency for Robust 3D Multi-Object Tracking

TL;DR

<3-5 sentence high-level summary> DSC-Track addresses robust 3D multi-object tracking in dynamic scenes by shifting focus from individual object motion to cue-consistency across spatial neighborhoods. It introduces a unified spatiotemporal encoder using rotation-invariant Point Pair Features, a Geometric Inject Attention mechanism, a temporal track token, and a cue-consistent cross-attention module with a memory-update scheme. The approach achieves state-of-the-art AMOTA on nuScenes (val and test) and shows strong generalization to Waymo, while running at real-time speeds. This work demonstrates the effectiveness of higher-order spatial relational cues for stable data association in crowded driving scenarios.

Abstract

3D multi-object tracking is a critical and challenging task in the field of autonomous driving. A common paradigm relies on modeling individual object motion, e.g., Kalman filters, to predict trajectories. While effective in simple scenarios, this approach often struggles in crowded environments or with inaccurate detections, as it overlooks the rich geometric relationships between objects. This highlights the need to leverage spatial cues. However, existing geometry-aware methods can be susceptible to interference from irrelevant objects, leading to ambiguous features and incorrect associations. To address this, we propose focusing on cue-consistency: identifying and matching stable spatial patterns over time. We introduce the Dynamic Scene Cue-Consistency Tracker (DSC-Track) to implement this principle. Firstly, we design a unified spatiotemporal encoder using Point Pair Features (PPF) to learn discriminative trajectory embeddings while suppressing interference. Secondly, our cue-consistency transformer module explicitly aligns consistent feature representations between historical tracks and current detections. Finally, a dynamic update mechanism preserves salient spatiotemporal information for stable online tracking. Extensive experiments on the nuScenes and Waymo Open Datasets validate the effectiveness and robustness of our approach. On the nuScenes benchmark, for instance, our method achieves state-of-the-art performance, reaching 73.2% and 70.3% AMOTA on the validation and test sets, respectively.

Paper Structure

This paper contains 40 sections, 12 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Comparison of a conventional tracking method and our proposed DSC-Track. Left: Previous methods, such as those relying on individual object motion, are prone to ID Switches (IDS) when faced with ambiguous associations, like objects in close proximity. Right: Our method explicitly models the spatial cue-consistency among all targets using a Transformer, enabling robust and correct associations in challenging scenarios.
  • Figure 2: The overall architecture of our DSC-Track framework. Our model takes historical track information and new 3D detections as input. (1) The Unified Spatio-Temporal Aggregation module first generates a discriminative feature representation, $\{\hat{\mathbf{Z}}_m\}$, for each track by leveraging its historical and spatial context. (2) Then, the Cue-Consistency Transformer interacts these track features with detection features ($\mathbf{B}^t$) to mine consistent cues, yielding enhanced representations for both. (3) Finally, in the Matching and Update stage, an affinity matrix is computed from these enhanced features for data association, and the memory buffer is updated for the next frame.
  • Figure 3: Left: Illustration of the rotation-invariance of our Point Pair Feature (PPF). Right: Details of the Geometric Inject Attention (GIA) module used for aggregation.
  • Figure 4: Qualitative results of DSC-Track on the nuScenes validation set. Our tracker successfully handles a long-term occlusion during a turn by leveraging stable geometric cues from the environment (e.g., the roadside), correctly re-identifying the target (ID 6) upon reappearance.