Table of Contents
Fetching ...

ADA-Track++: End-to-End Multi-Camera 3D Multi-Object Tracking with Alternating Detection and Association

Shuxiao Ding, Lukas Schneider, Marius Cordts, Juergen Gall

TL;DR

ADA-Track++ is introduced, a novel end-to-end framework for 3D MOT from multi-view cameras that introduces a learnable data association module based on edge-augmented cross-attention, leveraging appearance and geometric features and proposes an auxiliary token in this attention-based association module.

Abstract

Many query-based approaches for 3D Multi-Object Tracking (MOT) adopt the tracking-by-attention paradigm, utilizing track queries for identity-consistent detection and object queries for identity-agnostic track spawning. Tracking-by-attention, however, entangles detection and tracking queries in one embedding for both the detection and tracking task, which is sub-optimal. Other approaches resemble the tracking-by-detection paradigm and detect objects using decoupled track and detection queries followed by a subsequent association. These methods, however, do not leverage synergies between the detection and association task. Combining the strengths of both paradigms, we introduce ADA-Track++, a novel end-to-end framework for 3D MOT from multi-view cameras. We introduce a learnable data association module based on edge-augmented cross-attention, leveraging appearance and geometric features. We also propose an auxiliary token in this attention-based association module, which helps mitigate disproportionately high attention to incorrect association targets caused by attention normalization. Furthermore, we integrate this association module into the decoder layer of a DETR-based 3D detector, enabling simultaneous DETR-like query-to-image cross-attention for detection and query-to-query cross-attention for data association. By stacking these decoder layers, queries are refined for the detection and association task alternately, effectively harnessing the task dependencies. We evaluate our method on the nuScenes dataset and demonstrate the advantage of our approach compared to the two previous paradigms.

ADA-Track++: End-to-End Multi-Camera 3D Multi-Object Tracking with Alternating Detection and Association

TL;DR

ADA-Track++ is introduced, a novel end-to-end framework for 3D MOT from multi-view cameras that introduces a learnable data association module based on edge-augmented cross-attention, leveraging appearance and geometric features and proposes an auxiliary token in this attention-based association module.

Abstract

Many query-based approaches for 3D Multi-Object Tracking (MOT) adopt the tracking-by-attention paradigm, utilizing track queries for identity-consistent detection and object queries for identity-agnostic track spawning. Tracking-by-attention, however, entangles detection and tracking queries in one embedding for both the detection and tracking task, which is sub-optimal. Other approaches resemble the tracking-by-detection paradigm and detect objects using decoupled track and detection queries followed by a subsequent association. These methods, however, do not leverage synergies between the detection and association task. Combining the strengths of both paradigms, we introduce ADA-Track++, a novel end-to-end framework for 3D MOT from multi-view cameras. We introduce a learnable data association module based on edge-augmented cross-attention, leveraging appearance and geometric features. We also propose an auxiliary token in this attention-based association module, which helps mitigate disproportionately high attention to incorrect association targets caused by attention normalization. Furthermore, we integrate this association module into the decoder layer of a DETR-based 3D detector, enabling simultaneous DETR-like query-to-image cross-attention for detection and query-to-query cross-attention for data association. By stacking these decoder layers, queries are refined for the detection and association task alternately, effectively harnessing the task dependencies. We evaluate our method on the nuScenes dataset and demonstrate the advantage of our approach compared to the two previous paradigms.
Paper Structure (43 sections, 4 equations, 4 figures, 18 tables)

This paper contains 43 sections, 4 equations, 4 figures, 18 tables.

Figures (4)

  • Figure 1: Different paradigms of query-based MOT. Our proposed paradigm (\ref{['subfig:ours']}) leverages the advantages of the coupled architecture of tracking-by-attention (\ref{['subfig:tba']}) and the decoupled task-specific queries of tracking-by-detection (\ref{['subfig:tbd']}).
  • Figure 2: Overview of our ADA-Track framework. The transformer decoder takes decoupled track and detection queries, zero-initialized edge features, and multi-view image features as input. Each decoder layer first refines query features using a self-attention and a query-to-image cross-attention for object detection. Then a query-to-query edge-augmented cross-attention is applied to refine detection query and edge features for data association. By stacking this decoder layer, query features are updated for both tasks alternately and iteratively. A track update module associates both query sets and produces track queries for the next frame.
  • Figure 3: Target assignments: Tracking-by-attention (a) applies identity-guided matching for track queries and then matches detection queries to remaining ground truths using the Hungarian Algorithm. Our method (b) employs the same matching rules for both query types, but detection queries are matched to all ground-truths.
  • Figure 4: The difference in attention before and after softmax normalization. Ground-truth track IDs are represented by unique colors. Figure \ref{['subfig:wo_dummy']}: Although the yellow detection query does not have a corresponding track and its pre-softmax attentions are low everywhere, it still receives relatively high attention to the blue track after softmax normalization, which might disturb the correct association (blue track and blue detection). Figure \ref{['subfig:with_dummy']}: After introducing the auxiliary token, the normalized attention becomes much more reasonable.