Table of Contents
Fetching ...

CrossTracker: Robust Multi-modal 3D Multi-Object Tracking via Cross Correction

Lipeng Gu, Xuefeng Yan, Weiming Wang, Honghua Chen, Dingkun Zhu, Liangliang Nan, Mingqiang Wei

TL;DR

CrossTracker tackles robust online 3D MOT under dual-sensor detection failures by introducing a two-stage, coarse-to-fine framework. A central $M^3$ module fuses image, plane geometry, and point-cloud features to estimate cross-modal consistency, guiding a coarse trajectory generation stage ($C$-$TG$) followed by a trajectory refinement stage ($TR$) that performs cross-correction between camera and LiDAR streams. The approach achieves state-of-the-art results on KITTI, with substantial gains in HOTA and IDSW across detector configurations, and demonstrates the practical value of explicit trajectory refinement in multi-modal MOT. This work highlights the importance of bidirectional sensor collaboration and trajectory-centric refinement for robust 3D tracking in autonomous systems.

Abstract

The fusion of camera- and LiDAR-based detections offers a promising solution to mitigate tracking failures in 3D multi-object tracking (MOT). However, existing methods predominantly exploit camera detections to correct tracking failures caused by potential LiDAR detection problems, neglecting the reciprocal benefit of refining camera detections using LiDAR data. This limitation is rooted in their single-stage architecture, akin to single-stage object detectors, lacking a dedicated trajectory refinement module to fully exploit the complementary multi-modal information. To this end, we introduce CrossTracker, a novel two-stage paradigm for online multi-modal 3D MOT. CrossTracker operates in a coarse-to-fine manner, initially generating coarse trajectories and subsequently refining them through an independent refinement process. Specifically, CrossTracker incorporates three essential modules: i) a multi-modal modeling (M^3) module that, by fusing multi-modal information (images, point clouds, and even plane geometry extracted from images), provides a robust metric for subsequent trajectory generation. ii) a coarse trajectory generation (C-TG) module that generates initial coarse dual-stream trajectories, and iii) a trajectory refinement (TR) module that refines coarse trajectories through cross correction between camera and LiDAR streams. Comprehensive experiments demonstrate the superior performance of our CrossTracker over its eighteen competitors, underscoring its effectiveness in harnessing the synergistic benefits of camera and LiDAR sensors for robust multi-modal 3D MOT.

CrossTracker: Robust Multi-modal 3D Multi-Object Tracking via Cross Correction

TL;DR

CrossTracker tackles robust online 3D MOT under dual-sensor detection failures by introducing a two-stage, coarse-to-fine framework. A central module fuses image, plane geometry, and point-cloud features to estimate cross-modal consistency, guiding a coarse trajectory generation stage (-) followed by a trajectory refinement stage () that performs cross-correction between camera and LiDAR streams. The approach achieves state-of-the-art results on KITTI, with substantial gains in HOTA and IDSW across detector configurations, and demonstrates the practical value of explicit trajectory refinement in multi-modal MOT. This work highlights the importance of bidirectional sensor collaboration and trajectory-centric refinement for robust 3D tracking in autonomous systems.

Abstract

The fusion of camera- and LiDAR-based detections offers a promising solution to mitigate tracking failures in 3D multi-object tracking (MOT). However, existing methods predominantly exploit camera detections to correct tracking failures caused by potential LiDAR detection problems, neglecting the reciprocal benefit of refining camera detections using LiDAR data. This limitation is rooted in their single-stage architecture, akin to single-stage object detectors, lacking a dedicated trajectory refinement module to fully exploit the complementary multi-modal information. To this end, we introduce CrossTracker, a novel two-stage paradigm for online multi-modal 3D MOT. CrossTracker operates in a coarse-to-fine manner, initially generating coarse trajectories and subsequently refining them through an independent refinement process. Specifically, CrossTracker incorporates three essential modules: i) a multi-modal modeling (M^3) module that, by fusing multi-modal information (images, point clouds, and even plane geometry extracted from images), provides a robust metric for subsequent trajectory generation. ii) a coarse trajectory generation (C-TG) module that generates initial coarse dual-stream trajectories, and iii) a trajectory refinement (TR) module that refines coarse trajectories through cross correction between camera and LiDAR streams. Comprehensive experiments demonstrate the superior performance of our CrossTracker over its eighteen competitors, underscoring its effectiveness in harnessing the synergistic benefits of camera and LiDAR sensors for robust multi-modal 3D MOT.

Paper Structure

This paper contains 28 sections, 3 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Main difference between CrossTracker and its competitors. Prior methods eagermotdeepfusionmotsfmot typically categorize input detections into three sets using data fusion, and then sequentially process these detection sets based on spatial geometric constraints (SGC, e.g., 3D-IoU). They typically start with LiDAR-camera detections, followed by LiDAR-only detections. Finally, they use camera-only detections to correct tracking failures caused by potential LiDAR detection problems. However, they are unable to correct tracking failures in the camera stream using their single-stage architecture. Differently, our CrossTracker, the first online two-stage 3D MOT solution, excels in addressing intricate tracking failures in both streams. It leverages multi-modal modeling (M$^3$), encompassing image, geometric, and point cloud feature modeling (IFM, GFM, and PFM), followed by a two-stage tracking pipeline consisting of coarse trajectory generation (C-TG) and trajectory refinement (TR).
  • Figure 2: The advantages and disadvantages of existing data fusion-based multi-modal 3D MOT methods eagermotdeepfusionmotsfmot. These methods can effectively address (a) false detections in either the camera or LiDAR stream and (b) missing detections solely in the LiDAR stream. However, they are limited in their ability to handle (c) missing detections solely in the camera stream and (d) missing detections in both camera and LiDAR streams. In contrast, our CrossTracker draws on the coarse-to-fine concept, effectively addressing all four of these challenges through an innovative two-stage tracking scheme.
  • Figure 3: Overview of CrossTracker. It is the first two-stage multi-modal 3D MOT framework comprising three essential modules: a multi-modal modeling (M$^3$) module, a coarse trajectory generation (C-TG, i.e., stage-1) module, and a trajectory refinement (TR, i.e., stage-2) module. At each frame after the initial frame (e.g., frame $t-1$), trajectories within the camera or LiDAR stream can be categorized into matched trajectories ($T^{k}_{t-1}$), unmatched detections ($UD^{k}_{t-1}$), and unmatched trajectories ($UT^{k}_{t-1}$), where $k\in \left \{ c, l \right \}$ denotes the camera or LiDAR stream, respectively. Based on this, CrossTracker streamlines the two-stage 3D MOT problem by addressing these sets from both streams. Specifically, M$^3$ is designed with an image feature model (IFM), a point cloud feature model (PFM), and a plane geometric feature model (GFM), coupled with a classifier, to establish a robust metric for subsequent trajectory generation. Given the input camera detections ($D^{c}_{t}$) at frame $t$, C-TG initially associates$D^{c}_{t}$ with these sets ($T^{c}_{t-1}$, $UD^{c}_{t-1}$, and $UT^{c}_{t-1}$) at frame $t-1$, respectively. The same procedure is applied to the input LiDAR detections ($D^{l}_{t}$) at frame $t$. Following this, both streams yield updated matched trajectories ($T^{c}_{t}$ and $T^{l}_{t}$), remaining unmatched detections ($UD^{c}_{t}$ and $UD^{l}_{t}$), and remaining unmatch trajectories ($UT^{c}_{t-1}$ and $UT^{l}_{t-1}$) at frame $t$. Subsequently, TR implements cross correction to address tracking failures arising from potential false detection problems in $UD^{c}_{t}$ and $UD^{l}_{t}$ as well as missed detection problems in $UT^{l}_{t-1}$ and $UT^{c}_{t-1}$. Finally, high-quality 3D trajectories ($T^{o}_{t}$) is outputted by matching the updated $T^{l}_{t}$ with the updated $T^{c}_{t}$, $UD^{c}_{t}$, and $UT^{c}_{t}$.
  • Figure 4: Overview of M$^3$ module. It takes as input two consecutive frames of the image and point cloud, along with their corresponding detections. It independently outputs the consistency probabilities (similarity scores) of two objects for the camera ($S^{c}$) and LiDAR ($S^{l}$) scenario. M$^3$ comprises four primary components: the image feature modeling (IFM) module for image features ($F_{img}$), the plane geometric feature modeling (GFM) module for plane geometry features ($F_{pg}$) and the point cloud feature modeling (PFM) module for point cloud features ($F_{pc}$), and the camera- and LiDAR-modal classifiers for estimating consistency probabilities. Furthermore, SGC incorporates spatial geometric constraints, such as 3D centroid distance, during inference to further refine the consistency probability output by classifiers.
  • Figure 5: Five cases of the cross correction in TR. Each camera and LiDAR stream exhibits a trajectory. Solid shapes (squares for camera, cubes for LiDAR) represent detected objects, while dashed shapes indicate missed detections. (a) and (b) identify newly appearing objects from unmatched LiDAR detections in $UD^{l}_{t}$ using camera data as a reference, with a focus on minimizing false detections. (c) and (d) correct unmatched camera or LiDAR trajectories in $UT^{c}_{t-1}$ or $UT^{l}_{t-1}$ using the other modality. (e) corrects unmatched trajectories in $UT^{c}_{t-1}$ and $UT^{l}_{t-1}$ that may result from simultaneous misses in both modalities.
  • ...and 2 more figures