Table of Contents
Fetching ...

SynCL: A Synergistic Training Strategy with Instance-Aware Contrastive Learning for End-to-End Multi-Camera 3D Tracking

Shubo Lin, Yutong Kou, Zirui Wu, Shaoru Wang, Bing Li, Weiming Hu, Jin Gao

TL;DR

The paper addresses optimization conflicts in end-to-end multi-camera 3D MOT caused by self-attention when detection and tracking share parameters. It introduces SynCL, a plug-and-play training strategy that uses a weight-shared parallel decoder without self-attention, augmented by Task-specific Hybrid Matching, Dynamic Query Filtering, and Instance-aware Contrastive Learning to synergistically train detection and tracking. Across multiple detectors on nuScenes, SynCL delivers consistent improvements and achieves state-of-the-art AMOTA, exemplified by a reported $58.9\%$ AMOTA on nuScenes with negligible inference overhead. The approach demonstrates strong generalization and practical impact for end-to-end camera-based 3D MOT systems, providing a versatile training paradigm for multi-task optimization.

Abstract

While existing query-based 3D end-to-end visual trackers integrate detection and tracking via the tracking-by-attention paradigm, these two chicken-and-egg tasks encounter optimization difficulties when sharing the same parameters. Our findings reveal that these difficulties arise due to two inherent constraints on the self-attention mechanism, i.e., over-deduplication for object queries and self-centric attention for track queries. In contrast, removing the self-attention mechanism not only minimally impacts regression predictions of the tracker, but also tends to generate more latent candidate boxes. Based on these analyses, we present SynCL, a novel plug-and-play synergistic training strategy designed to co-facilitate multi-task learning for detection and tracking. Specifically, we propose a Task-specific Hybrid Matching module for a weight-shared cross-attention-based decoder that matches the targets of track queries with multiple object queries to exploit promising candidates overlooked by the self-attention mechanism. To flexibly select optimal candidates for the one-to-many matching, we also design a Dynamic Query Filtering module controlled by model training status. Moreover, we introduce Instance-aware Contrastive Learning to break through the barrier of self-centric attention for track queries, effectively bridging the gap between detection and tracking. Without additional inference costs, SynCL consistently delivers improvements in various benchmarks and achieves state-of-the-art performance with $58.9\%$ AMOTA on the nuScenes dataset. Code and raw results will be publicly available.

SynCL: A Synergistic Training Strategy with Instance-Aware Contrastive Learning for End-to-End Multi-Camera 3D Tracking

TL;DR

The paper addresses optimization conflicts in end-to-end multi-camera 3D MOT caused by self-attention when detection and tracking share parameters. It introduces SynCL, a plug-and-play training strategy that uses a weight-shared parallel decoder without self-attention, augmented by Task-specific Hybrid Matching, Dynamic Query Filtering, and Instance-aware Contrastive Learning to synergistically train detection and tracking. Across multiple detectors on nuScenes, SynCL delivers consistent improvements and achieves state-of-the-art AMOTA, exemplified by a reported AMOTA on nuScenes with negligible inference overhead. The approach demonstrates strong generalization and practical impact for end-to-end camera-based 3D MOT systems, providing a versatile training paradigm for multi-task optimization.

Abstract

While existing query-based 3D end-to-end visual trackers integrate detection and tracking via the tracking-by-attention paradigm, these two chicken-and-egg tasks encounter optimization difficulties when sharing the same parameters. Our findings reveal that these difficulties arise due to two inherent constraints on the self-attention mechanism, i.e., over-deduplication for object queries and self-centric attention for track queries. In contrast, removing the self-attention mechanism not only minimally impacts regression predictions of the tracker, but also tends to generate more latent candidate boxes. Based on these analyses, we present SynCL, a novel plug-and-play synergistic training strategy designed to co-facilitate multi-task learning for detection and tracking. Specifically, we propose a Task-specific Hybrid Matching module for a weight-shared cross-attention-based decoder that matches the targets of track queries with multiple object queries to exploit promising candidates overlooked by the self-attention mechanism. To flexibly select optimal candidates for the one-to-many matching, we also design a Dynamic Query Filtering module controlled by model training status. Moreover, we introduce Instance-aware Contrastive Learning to break through the barrier of self-centric attention for track queries, effectively bridging the gap between detection and tracking. Without additional inference costs, SynCL consistently delivers improvements in various benchmarks and achieves state-of-the-art performance with AMOTA on the nuScenes dataset. Code and raw results will be publicly available.

Paper Structure

This paper contains 13 sections, 13 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Comparisons between the tracking-by-attention paradigm and our proposed plug-and-play training strategy, SynCL. SynCL consists of Task-specific Hybrid Matching, Instance-aware Contrastive Learning powered by a Dynamic Query Filtering moudle.
  • Figure 2: We compare results from the standard decoder in Fig. \ref{['fig:vis_analysis']}b with those from the decoder without self-attention in Fig. \ref{['fig:vis_analysis']}c in inference stage. The self-attention mechanism exhibits over-deduplication for object queries and self-centric attention for track queries. Results are from the model trained without utilizing SynCL.
  • Figure 3: Analysis of self-attention heatmap in the standard decoder. The annotation numbers of the heatmap are aligned with the ID numbers in Fig. \ref{['fig:vis_analysis']}.
  • Figure 4: Overview of SynCL. SynCL is based on tracking-by-attention paradigm trackers, with two weight-shared parallel decoders: a S-decoder (standard decoder) and a C-decoder (devoid of self-attention layers). In C-decoder, hybrid matching with one-to-many and one-to-one assignment is applied for object queries and track queries, respectively. Besides, a dynamic filtering module is designed to flexibly select reliable object queries for the one-to-many assignment. With identical ground-truth matching, contrastive learning unifies the representations between object and track queries, co-facilitating multi-task learning for detection and tracking.
  • Figure 5: Analysis of training time (h) and GPU memory (G). The inference speed remains unchanged.
  • ...and 1 more figures