SynCL: A Synergistic Training Strategy with Instance-Aware Contrastive Learning for End-to-End Multi-Camera 3D Tracking

Shubo Lin; Yutong Kou; Zirui Wu; Shaoru Wang; Bing Li; Weiming Hu; Jin Gao

SynCL: A Synergistic Training Strategy with Instance-Aware Contrastive Learning for End-to-End Multi-Camera 3D Tracking

Shubo Lin, Yutong Kou, Zirui Wu, Shaoru Wang, Bing Li, Weiming Hu, Jin Gao

TL;DR

The paper addresses optimization conflicts in end-to-end multi-camera 3D MOT caused by self-attention when detection and tracking share parameters. It introduces SynCL, a plug-and-play training strategy that uses a weight-shared parallel decoder without self-attention, augmented by Task-specific Hybrid Matching, Dynamic Query Filtering, and Instance-aware Contrastive Learning to synergistically train detection and tracking. Across multiple detectors on nuScenes, SynCL delivers consistent improvements and achieves state-of-the-art AMOTA, exemplified by a reported $58.9\%$ AMOTA on nuScenes with negligible inference overhead. The approach demonstrates strong generalization and practical impact for end-to-end camera-based 3D MOT systems, providing a versatile training paradigm for multi-task optimization.

Abstract

While existing query-based 3D end-to-end visual trackers integrate detection and tracking via the tracking-by-attention paradigm, these two chicken-and-egg tasks encounter optimization difficulties when sharing the same parameters. Our findings reveal that these difficulties arise due to two inherent constraints on the self-attention mechanism, i.e., over-deduplication for object queries and self-centric attention for track queries. In contrast, removing the self-attention mechanism not only minimally impacts regression predictions of the tracker, but also tends to generate more latent candidate boxes. Based on these analyses, we present SynCL, a novel plug-and-play synergistic training strategy designed to co-facilitate multi-task learning for detection and tracking. Specifically, we propose a Task-specific Hybrid Matching module for a weight-shared cross-attention-based decoder that matches the targets of track queries with multiple object queries to exploit promising candidates overlooked by the self-attention mechanism. To flexibly select optimal candidates for the one-to-many matching, we also design a Dynamic Query Filtering module controlled by model training status. Moreover, we introduce Instance-aware Contrastive Learning to break through the barrier of self-centric attention for track queries, effectively bridging the gap between detection and tracking. Without additional inference costs, SynCL consistently delivers improvements in various benchmarks and achieves state-of-the-art performance with $58.9\%$ AMOTA on the nuScenes dataset. Code and raw results will be publicly available.

SynCL: A Synergistic Training Strategy with Instance-Aware Contrastive Learning for End-to-End Multi-Camera 3D Tracking

TL;DR

Abstract

SynCL: A Synergistic Training Strategy with Instance-Aware Contrastive Learning for End-to-End Multi-Camera 3D Tracking

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)