CC-3DT: Panoramic 3D Object Tracking via Cross-Camera Fusion
Tobias Fischer, Yung-Hsu Yang, Suryansh Kumar, Min Sun, Fisher Yu
TL;DR
CC-3DT tackles panoramic 3D object tracking by fusing detections from all vehicle-mounted cameras before data association, enabling cross-camera and temporal trajectory modeling. The approach introduces cross-camera association and cross-camera motion modeling via dual LSTM networks, leading to longer, more accurate trajectories and fewer identity switches. Empirical results on NuScenes and Waymo Open show state-of-the-art performance for camera-based 3D MOT, with notable gains in AMOTA, AMOTP, and IDS over strong baselines and MUTR3D. The method is detector-agnostic and demonstrates robust improvements across datasets, underscoring the value of multi-view fusion for reliable 3D perception in autonomous driving.
Abstract
To track the 3D locations and trajectories of the other traffic participants at any given time, modern autonomous vehicles are equipped with multiple cameras that cover the vehicle's full surroundings. Yet, camera-based 3D object tracking methods prioritize optimizing the single-camera setup and resort to post-hoc fusion in a multi-camera setup. In this paper, we propose a method for panoramic 3D object tracking, called CC-3DT, that associates and models object trajectories both temporally and across views, and improves the overall tracking consistency. In particular, our method fuses 3D detections from multiple cameras before association, reducing identity switches significantly and improving motion modeling. Our experiments on large-scale driving datasets show that fusion before association leads to a large margin of improvement over post-hoc fusion. We set a new state-of-the-art with 12.6% improvement in average multi-object tracking accuracy (AMOTA) among all camera-based methods on the competitive NuScenes 3D tracking benchmark, outperforming previously published methods by 6.5% in AMOTA with the same 3D detector.
