Table of Contents
Fetching ...

CC-3DT: Panoramic 3D Object Tracking via Cross-Camera Fusion

Tobias Fischer, Yung-Hsu Yang, Suryansh Kumar, Min Sun, Fisher Yu

TL;DR

CC-3DT tackles panoramic 3D object tracking by fusing detections from all vehicle-mounted cameras before data association, enabling cross-camera and temporal trajectory modeling. The approach introduces cross-camera association and cross-camera motion modeling via dual LSTM networks, leading to longer, more accurate trajectories and fewer identity switches. Empirical results on NuScenes and Waymo Open show state-of-the-art performance for camera-based 3D MOT, with notable gains in AMOTA, AMOTP, and IDS over strong baselines and MUTR3D. The method is detector-agnostic and demonstrates robust improvements across datasets, underscoring the value of multi-view fusion for reliable 3D perception in autonomous driving.

Abstract

To track the 3D locations and trajectories of the other traffic participants at any given time, modern autonomous vehicles are equipped with multiple cameras that cover the vehicle's full surroundings. Yet, camera-based 3D object tracking methods prioritize optimizing the single-camera setup and resort to post-hoc fusion in a multi-camera setup. In this paper, we propose a method for panoramic 3D object tracking, called CC-3DT, that associates and models object trajectories both temporally and across views, and improves the overall tracking consistency. In particular, our method fuses 3D detections from multiple cameras before association, reducing identity switches significantly and improving motion modeling. Our experiments on large-scale driving datasets show that fusion before association leads to a large margin of improvement over post-hoc fusion. We set a new state-of-the-art with 12.6% improvement in average multi-object tracking accuracy (AMOTA) among all camera-based methods on the competitive NuScenes 3D tracking benchmark, outperforming previously published methods by 6.5% in AMOTA with the same 3D detector.

CC-3DT: Panoramic 3D Object Tracking via Cross-Camera Fusion

TL;DR

CC-3DT tackles panoramic 3D object tracking by fusing detections from all vehicle-mounted cameras before data association, enabling cross-camera and temporal trajectory modeling. The approach introduces cross-camera association and cross-camera motion modeling via dual LSTM networks, leading to longer, more accurate trajectories and fewer identity switches. Empirical results on NuScenes and Waymo Open show state-of-the-art performance for camera-based 3D MOT, with notable gains in AMOTA, AMOTP, and IDS over strong baselines and MUTR3D. The method is detector-agnostic and demonstrates robust improvements across datasets, underscoring the value of multi-view fusion for reliable 3D perception in autonomous driving.

Abstract

To track the 3D locations and trajectories of the other traffic participants at any given time, modern autonomous vehicles are equipped with multiple cameras that cover the vehicle's full surroundings. Yet, camera-based 3D object tracking methods prioritize optimizing the single-camera setup and resort to post-hoc fusion in a multi-camera setup. In this paper, we propose a method for panoramic 3D object tracking, called CC-3DT, that associates and models object trajectories both temporally and across views, and improves the overall tracking consistency. In particular, our method fuses 3D detections from multiple cameras before association, reducing identity switches significantly and improving motion modeling. Our experiments on large-scale driving datasets show that fusion before association leads to a large margin of improvement over post-hoc fusion. We set a new state-of-the-art with 12.6% improvement in average multi-object tracking accuracy (AMOTA) among all camera-based methods on the competitive NuScenes 3D tracking benchmark, outperforming previously published methods by 6.5% in AMOTA with the same 3D detector.
Paper Structure (38 sections, 10 equations, 5 figures, 8 tables)

This paper contains 38 sections, 10 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Method overview. We first perform 3D object detection and appearance feature extraction for each camera view independently. Then, we lift the camera-space detections into world-space and merge them using non-maximum suppression frcnn in 3D. Next, we associate the detections with existing tracks across all cameras. Finally, we refine the detections given the trajectories of the associated tracks across cameras.
  • Figure 2: Qualitative comparison. Different types of cross-camera association on NuScenes validation split. We plot the 3D car states in camera and 3D views and depict car identity as color. Clearly, CC-3DT provides a smooth and precise trajectory across cameras (green). Detect $\rightarrow$ Track $\rightarrow$ Merge shows awry car localization (purple), while QD-3DT lacks car identity (violet).
  • Figure 3: Qualitative results of our tracker on the Waymo Open validation split. Note the consistent identity of objects moving along camera borders.
  • Figure 4: Qualitative results of our tracker on the front camera of the NuScenes validation split. Note the consistent identity of objects moving along camera borders.
  • Figure 5: Qualitative results of our tracker on the back cameras of the NuScenes validation split. Note the consistent identity of objects moving along camera borders.