Table of Contents
Fetching ...

TQD-Track: Temporal Query Denoising for 3D Multi-Object Tracking

Shuxiao Ding, Yutong Yang, Julian Wiederer, Markus Braun, Peizheng Li, Juergen Gall, Bin Yang

TL;DR

This work tackles the limitation of static query denoising in 3D MOT by introducing Temporal Query Denoising (TQD) through a Temporal Denoising Query Generator (TDQ-Gen) that creates denoising queries from previous-frame ground-truths and propagates them to the current frame. The method injects temporal cues and instance-specific features into the MOT training process, employing self-attention and association masks to preserve realistic query interactions, and exploring dedicated, general, and hybrid denoising groups. Across Tracking-by-Attention (TBA), Tracking-by-Detection (TBD), and Alternating Detection and Association (ADA) paradigms, temporal denoising (especially with an explicit association module) yields higher AMOTA/MOTA and lower IDS, with ADA-Track + TQD-Track achieving state-of-the-art results on nuScenes. The approach augments training diversity without changing inference, improving robustness to temporal uncertainty and rare behaviors, and demonstrating practical impact for multi-view 3D MOT systems. The key technical contributions include TDQ-Gen, denoising group strategies (general, dedicated, hybrid), an association mask for learned data association, and extensive ablations validating the benefits of temporal denoising in MOT.

Abstract

Query denoising has become a standard training strategy for DETR-based detectors by addressing the slow convergence issue. Besides that, query denoising can be used to increase the diversity of training samples for modeling complex scenarios which is critical for Multi-Object Tracking (MOT), showing its potential in MOT application. Existing approaches integrate query denoising within the tracking-by-attention paradigm. However, as the denoising process only happens within the single frame, it cannot benefit the tracker to learn temporal-related information. In addition, the attention mask in query denoising prevents information exchange between denoising and object queries, limiting its potential in improving association using self-attention. To address these issues, we propose TQD-Track, which introduces Temporal Query Denoising (TQD) tailored for MOT, enabling denoising queries to carry temporal information and instance-specific feature representation. We introduce diverse noise types onto denoising queries that simulate real-world challenges in MOT. We analyze our proposed TQD for different tracking paradigms, and find out the paradigm with explicit learned data association module, e.g. tracking-by-detection or alternating detection and association, benefit from TQD by a larger margin. For these paradigms, we further design an association mask in the association module to ensure the consistent interaction between track and detection queries as during inference. Extensive experiments on the nuScenes dataset demonstrate that our approach consistently enhances different tracking methods by only changing the training process, especially the paradigms with explicit association module.

TQD-Track: Temporal Query Denoising for 3D Multi-Object Tracking

TL;DR

This work tackles the limitation of static query denoising in 3D MOT by introducing Temporal Query Denoising (TQD) through a Temporal Denoising Query Generator (TDQ-Gen) that creates denoising queries from previous-frame ground-truths and propagates them to the current frame. The method injects temporal cues and instance-specific features into the MOT training process, employing self-attention and association masks to preserve realistic query interactions, and exploring dedicated, general, and hybrid denoising groups. Across Tracking-by-Attention (TBA), Tracking-by-Detection (TBD), and Alternating Detection and Association (ADA) paradigms, temporal denoising (especially with an explicit association module) yields higher AMOTA/MOTA and lower IDS, with ADA-Track + TQD-Track achieving state-of-the-art results on nuScenes. The approach augments training diversity without changing inference, improving robustness to temporal uncertainty and rare behaviors, and demonstrating practical impact for multi-view 3D MOT systems. The key technical contributions include TDQ-Gen, denoising group strategies (general, dedicated, hybrid), an association mask for learned data association, and extensive ablations validating the benefits of temporal denoising in MOT.

Abstract

Query denoising has become a standard training strategy for DETR-based detectors by addressing the slow convergence issue. Besides that, query denoising can be used to increase the diversity of training samples for modeling complex scenarios which is critical for Multi-Object Tracking (MOT), showing its potential in MOT application. Existing approaches integrate query denoising within the tracking-by-attention paradigm. However, as the denoising process only happens within the single frame, it cannot benefit the tracker to learn temporal-related information. In addition, the attention mask in query denoising prevents information exchange between denoising and object queries, limiting its potential in improving association using self-attention. To address these issues, we propose TQD-Track, which introduces Temporal Query Denoising (TQD) tailored for MOT, enabling denoising queries to carry temporal information and instance-specific feature representation. We introduce diverse noise types onto denoising queries that simulate real-world challenges in MOT. We analyze our proposed TQD for different tracking paradigms, and find out the paradigm with explicit learned data association module, e.g. tracking-by-detection or alternating detection and association, benefit from TQD by a larger margin. For these paradigms, we further design an association mask in the association module to ensure the consistent interaction between track and detection queries as during inference. Extensive experiments on the nuScenes dataset demonstrate that our approach consistently enhances different tracking methods by only changing the training process, especially the paradigms with explicit association module.

Paper Structure

This paper contains 34 sections, 12 equations, 3 figures, 9 tables.

Figures (3)

  • Figure 1: Comparison between static and temporal query denoising in MOT. Static denoising \ref{['subfig:static']} generates denoising (DN) queries from the ground truths (GTs) of frame $t$ by adding only geometric noise, tasking these DN queries to reconstruct the GTs at the same frame $t$. In contrast, our proposed temporal denoising \ref{['subfig:temp']} generates denoising queries from the GTs of the previous frame $t-1$ using our novel temporal denoising query generator (TDQ-Gen) which considers various temporal-related noise types. These DN queries are then propagated to the current frame $t$, aiming to reconstruct their corresponding GTs at $t$ instead of $t-1$.
  • Figure 2: Overview of TQD-Track applied on an DETR-based tracker. For a frame $t-1$, we generate several groups of denoising queries using the ground truth and add various noises onto them in the Temporal Denoising Query Generator (TDQ-Gen). The denoising queries are propagated to frame $t$, participate in the model as augmented input queries, and reconstruct their corresponding ground truth instance at the current frame $t$. Depending on the tracking paradigms, we use a self-attention mask (b) and/or an association mask (c) to align the forward pass during training with inference. Gray grids in (b) and (c) denote blocked attentions.
  • Figure A: AMOTA vs. Portion of nuScenes training data by scaling the nuScenes training set.