CrossTracker: Robust Multi-modal 3D Multi-Object Tracking via Cross Correction
Lipeng Gu, Xuefeng Yan, Weiming Wang, Honghua Chen, Dingkun Zhu, Liangliang Nan, Mingqiang Wei
TL;DR
CrossTracker tackles robust online 3D MOT under dual-sensor detection failures by introducing a two-stage, coarse-to-fine framework. A central $M^3$ module fuses image, plane geometry, and point-cloud features to estimate cross-modal consistency, guiding a coarse trajectory generation stage ($C$-$TG$) followed by a trajectory refinement stage ($TR$) that performs cross-correction between camera and LiDAR streams. The approach achieves state-of-the-art results on KITTI, with substantial gains in HOTA and IDSW across detector configurations, and demonstrates the practical value of explicit trajectory refinement in multi-modal MOT. This work highlights the importance of bidirectional sensor collaboration and trajectory-centric refinement for robust 3D tracking in autonomous systems.
Abstract
The fusion of camera- and LiDAR-based detections offers a promising solution to mitigate tracking failures in 3D multi-object tracking (MOT). However, existing methods predominantly exploit camera detections to correct tracking failures caused by potential LiDAR detection problems, neglecting the reciprocal benefit of refining camera detections using LiDAR data. This limitation is rooted in their single-stage architecture, akin to single-stage object detectors, lacking a dedicated trajectory refinement module to fully exploit the complementary multi-modal information. To this end, we introduce CrossTracker, a novel two-stage paradigm for online multi-modal 3D MOT. CrossTracker operates in a coarse-to-fine manner, initially generating coarse trajectories and subsequently refining them through an independent refinement process. Specifically, CrossTracker incorporates three essential modules: i) a multi-modal modeling (M^3) module that, by fusing multi-modal information (images, point clouds, and even plane geometry extracted from images), provides a robust metric for subsequent trajectory generation. ii) a coarse trajectory generation (C-TG) module that generates initial coarse dual-stream trajectories, and iii) a trajectory refinement (TR) module that refines coarse trajectories through cross correction between camera and LiDAR streams. Comprehensive experiments demonstrate the superior performance of our CrossTracker over its eighteen competitors, underscoring its effectiveness in harnessing the synergistic benefits of camera and LiDAR sensors for robust multi-modal 3D MOT.
