Table of Contents
Fetching ...

Track Initialization and Re-Identification for~3D Multi-View Multi-Object Tracking

Linh Van Ma, Tran Thien Dat Nguyen, Ba-Ngu Vo, Hyunsung Jang, Moongu Jeon

TL;DR

This work tackles 3D multi-object tracking from 2D detections derived from monocular cameras, aiming to automatically initialize/terminate tracks, resolve appearance-reappearance, and handle occlusions without retraining detectors when camera setups change. It introduces a Bayesian MV-MOT framework that employs a GLMB-based approximation (MV-GLMB-AB) with a high-fidelity occlusion model and an adaptive birth process to realize track initiation, re-identification, and occlusion handling online. Experimental results on WT and CMC datasets show significant performance gains over existing MV-MOT methods, with robust operation under on-the-fly camera reconfigurations and competitive results against ideal detectors. The approach leverages per-camera detections and features to refine data association and reduces the number of terms in the filtering density, enabling real-time 3D tracking from monocular video streams. The work also demonstrates how TT-track relabeling and feature-based similarity can support re-identification, contributing to practical, scalable 3D MOT in multi-camera setups.

Abstract

We propose a 3D multi-object tracking (MOT) solution using only 2D detections from monocular cameras, which automatically initiates/terminates tracks as well as resolves track appearance-reappearance and occlusions. Moreover, this approach does not require detector retraining when cameras are reconfigured but only the camera matrices of reconfigured cameras need to be updated. Our approach is based on a Bayesian multi-object formulation that integrates track initiation/termination, re-identification, occlusion handling, and data association into a single Bayes filtering recursion. However, the exact filter that utilizes all these functionalities is numerically intractable due to the exponentially growing number of terms in the (multi-object) filtering density, while existing approximations trade-off some of these functionalities for speed. To this end, we develop a more efficient approximation suitable for online MOT by incorporating object features and kinematics into the measurement model, which improves data association and subsequently reduces the number of terms. Specifically, we exploit the 2D detections and extracted features from multiple cameras to provide a better approximation of the multi-object filtering density to realize the track initiation/termination and re-identification functionalities. Further, incorporating a tractable geometric occlusion model based on 2D projections of 3D objects on the camera planes realizes the occlusion handling functionality of the filter. Evaluation of the proposed solution on challenging datasets demonstrates significant improvements and robustness when camera configurations change on-the-fly, compared to existing multi-view MOT solutions. The source code is publicly available at https://github.com/linh-gist/mv-glmb-ab.

Track Initialization and Re-Identification for~3D Multi-View Multi-Object Tracking

TL;DR

This work tackles 3D multi-object tracking from 2D detections derived from monocular cameras, aiming to automatically initialize/terminate tracks, resolve appearance-reappearance, and handle occlusions without retraining detectors when camera setups change. It introduces a Bayesian MV-MOT framework that employs a GLMB-based approximation (MV-GLMB-AB) with a high-fidelity occlusion model and an adaptive birth process to realize track initiation, re-identification, and occlusion handling online. Experimental results on WT and CMC datasets show significant performance gains over existing MV-MOT methods, with robust operation under on-the-fly camera reconfigurations and competitive results against ideal detectors. The approach leverages per-camera detections and features to refine data association and reduces the number of terms in the filtering density, enabling real-time 3D tracking from monocular video streams. The work also demonstrates how TT-track relabeling and feature-based similarity can support re-identification, contributing to practical, scalable 3D MOT in multi-camera setups.

Abstract

We propose a 3D multi-object tracking (MOT) solution using only 2D detections from monocular cameras, which automatically initiates/terminates tracks as well as resolves track appearance-reappearance and occlusions. Moreover, this approach does not require detector retraining when cameras are reconfigured but only the camera matrices of reconfigured cameras need to be updated. Our approach is based on a Bayesian multi-object formulation that integrates track initiation/termination, re-identification, occlusion handling, and data association into a single Bayes filtering recursion. However, the exact filter that utilizes all these functionalities is numerically intractable due to the exponentially growing number of terms in the (multi-object) filtering density, while existing approximations trade-off some of these functionalities for speed. To this end, we develop a more efficient approximation suitable for online MOT by incorporating object features and kinematics into the measurement model, which improves data association and subsequently reduces the number of terms. Specifically, we exploit the 2D detections and extracted features from multiple cameras to provide a better approximation of the multi-object filtering density to realize the track initiation/termination and re-identification functionalities. Further, incorporating a tractable geometric occlusion model based on 2D projections of 3D objects on the camera planes realizes the occlusion handling functionality of the filter. Evaluation of the proposed solution on challenging datasets demonstrates significant improvements and robustness when camera configurations change on-the-fly, compared to existing multi-view MOT solutions. The source code is publicly available at https://github.com/linh-gist/mv-glmb-ab.
Paper Structure (26 sections, 24 equations, 8 figures, 9 tables, 1 algorithm)

This paper contains 26 sections, 24 equations, 8 figures, 9 tables, 1 algorithm.

Figures (8)

  • Figure 1: Schematic of the proposed 3D MV-MOT solution. Multi-view detections (bounding boxes and visual features from all cameras) is supplied to the MV-MOT filter, which integrates multi-object dynamic and measurement models to realize all MOT functionalities.
  • Figure 2: Schematic of the proposed multi-view MOT filter, with Adaptive Birth Model and Occlusion Model that realize MOT functionalities.
  • Figure 3: For illustration, tracks are indexed from the closest to the furthest from the camera. Track 4 has no overlap with any other tracks and thus has maximum detection probability. Tracks 1 and 2 overlap with other tracks, but closer to the camera (i.e., lower bottom corner), hence they also have maximum detection probability. Track 6 overlaps with track 5, but track 5 has higher detection probability, because it is closer to the camera.
  • Figure 4: 3D ellipsoid estimates from the proposed MV-GLMB-AB filter using CSTrack detection inputs. Top row: CMC5 dataset at frame 470 for cameras 2 and 3. Bottom row: WT dataset at frame 25 for cameras 2 and 5 (only objects inside the red boundary are considered). The first two columns show the projected 3D estimates on the respective camera planes, and the last column shows the 3D estimates. Each color corresponds to a unique object ID. Videos are also provided in the supplementary materials.
  • Figure 5: Track re-identification: 3D ellipsoid estimates from the MV-GLMB-AB filter using CSTrack detection. Object disappearance-reappearance (in CMC) is simulated by turning off all cameras mid-scene for 30 frames. Top row: CMC2-all cameras off from frames 130-160. Middle row: CMC3-all cameras off from frames 131-161. Bottom row: CMC5-all cameras off from frames 280-310. Columns 1 and 2 show estimates, projected on camera 2 and in 3D, before turning off all cameras. Columns 3 and 4 show the estimates 5 frames after all cameras are turned back on.
  • ...and 3 more figures