Table of Contents
Fetching ...

ModTrack: Sensor-Agnostic Multi-View Tracking via Identity-Informed PHD Filtering with Covariance Propagation

Aditya Iyer, Jack Roberts, Nora Ayanian

Abstract

Multi-View Multi-Object Tracking (MV-MOT) aims to localize and maintain consistent identities of objects observed by multiple sensors. This task is challenging, as viewpoint changes and occlusion disrupt identity consistency across views and time. Recent end-to-end approaches address this by jointly learning 2D Bird's Eye View (BEV) representations and identity associations, achieving high tracking accuracy. However, these methods offer no principled uncertainty accounting and remain tightly coupled to their training configuration, limiting generalization across sensor layouts, modalities, or datasets without retraining. We propose ModTrack, a modular MV-MOT system that matches end-to-end performance while providing cross-modal, sensor-agnostic generalization and traceable uncertainty. ModTrack confines learning methods to just the \textit{Detection and Feature Extraction} stage of the MV-MOT pipeline, performing all fusion, association, and tracking with closed-form analytical methods. Our design reduces each sensor's output to calibrated position-covariance pairs $(\mathbf{z}, R)$; cross-view clustering and precision-weighted fusion then yield unified estimates $(\hat{\mathbf{z}}, \hat{R})$ for identity assignment and temporal tracking. A feedback-coupled, identity-informed Gaussian Mixture Probability Hypothesis Density (GM-PHD) filter with HMM motion modes uses these fused estimates to maintain identities under missed detections and heavy occlusion. ModTrack achieves 95.5 IDF1 and 91.4 MOTA on \textit{WildTrack}, surpassing all prior modular methods by over 21 points and rivaling the state-of-the-art end-to-end methods while providing deployment flexibility they cannot. Specifically, the same tracker core transfers unchanged to \textit{MultiviewX} and \textit{RadarScenes}, with only perception-module replacement required to extend to new domains and sensor modalities.

ModTrack: Sensor-Agnostic Multi-View Tracking via Identity-Informed PHD Filtering with Covariance Propagation

Abstract

Multi-View Multi-Object Tracking (MV-MOT) aims to localize and maintain consistent identities of objects observed by multiple sensors. This task is challenging, as viewpoint changes and occlusion disrupt identity consistency across views and time. Recent end-to-end approaches address this by jointly learning 2D Bird's Eye View (BEV) representations and identity associations, achieving high tracking accuracy. However, these methods offer no principled uncertainty accounting and remain tightly coupled to their training configuration, limiting generalization across sensor layouts, modalities, or datasets without retraining. We propose ModTrack, a modular MV-MOT system that matches end-to-end performance while providing cross-modal, sensor-agnostic generalization and traceable uncertainty. ModTrack confines learning methods to just the \textit{Detection and Feature Extraction} stage of the MV-MOT pipeline, performing all fusion, association, and tracking with closed-form analytical methods. Our design reduces each sensor's output to calibrated position-covariance pairs ; cross-view clustering and precision-weighted fusion then yield unified estimates for identity assignment and temporal tracking. A feedback-coupled, identity-informed Gaussian Mixture Probability Hypothesis Density (GM-PHD) filter with HMM motion modes uses these fused estimates to maintain identities under missed detections and heavy occlusion. ModTrack achieves 95.5 IDF1 and 91.4 MOTA on \textit{WildTrack}, surpassing all prior modular methods by over 21 points and rivaling the state-of-the-art end-to-end methods while providing deployment flexibility they cannot. Specifically, the same tracker core transfers unchanged to \textit{MultiviewX} and \textit{RadarScenes}, with only perception-module replacement required to extend to new domains and sensor modalities.
Paper Structure (42 sections, 30 equations, 14 figures, 8 tables)

This paper contains 42 sections, 30 equations, 14 figures, 8 tables.

Figures (14)

  • Figure 1: ModTrack pipeline. Per-camera detections and appearance features (Stage 1) are projected to BEV with Jacobian-propagated covariance (Stage 2), associated via $\chi^2$ graph clustering (Stage 3), and precision-weighted into fused $(\hat{\mathbf{z}},\hat{R})$ pairs (Stage 4). An identity-informed GM-PHD filter (Stage 5) maintains persistent identities, with track predictions feeding back to guide identity assignment (dashed). Neural modules (blue) are confined to Stage 1; all downstream stages are purely analytical.
  • Figure 2: Predicted BEV tracks on WildTrack test set colored by identity. Ground truth (left) vs. ModTrack (right).
  • Figure 3: Precision-weighted fusion on WildTrack (frame 383). Each panel shows BEV covariance ellipses (95% confidence) as cameras are incrementally added. The rightmost panel shows the 6-camera shared world-plane.
  • Figure 4: IDF1 sensitivity to hyperparameter perturbation on WildTrack (joint mode). Each parameter is swept individually; all others held at defaults. Extended results are in Supp. Material D.
  • Figure 5: MOTA sensitivity to continuous clustering hyperparameters on WildTrack.
  • ...and 9 more figures