Table of Contents
Fetching ...

Pedestrian Tracking with Monocular Camera using Unconstrained 3D Motion Model

Jan Krejčí, Oliver Kost, Ondřej Straka, Jindřich Duník

TL;DR

This work tackles monocular pedestrian tracking without imposing a ground-plane constraint by introducing a 3D state $oldsymbol{x}^{ ext{3D}}$ that combines position, velocity, and 3D extents (width $omega$ and height $h$). A nonlinear measurement model arises from perspective projection, and an unscented Kalman filter (UKF) is used to fuse monocular detections into 3D estimates via a carefully designed AR process for the bounding-box extents. The contributions include a complete 3D state-space formulation, interpretable process parameters (means and time constants for width/height), and a numerically stable UKF implementation with initialization strategies; evaluation on MOT-17 demonstrates consistent 2D projections and plausible 3D trajectories, with ANEES near one and RMSE improvements over 2D baselines. The approach enables depth-aware pedestrian tracking from a single camera and provides a foundation for estimating the scene ground plane from tracked trajectories in future work.

Abstract

A first-principle single-object model is proposed for pedestrian tracking. It is assumed that the extent of the moving object can be described via known statistics in 3D, such as pedestrian height. The proposed model thus need not constrain the object motion in 3D to a common ground plane, which is usual in 3D visual tracking applications. A nonlinear filter for this model is implemented using the unscented Kalman filter (UKF) and tested using the publicly available MOT-17 dataset. The proposed solution yields promising results in 3D while maintaining perfect results when projected into the 2D image. Moreover, the estimation error covariance matches the true one. Unlike conventional methods, the introduced model parameters have convenient meaning and can readily be adjusted for a problem.

Pedestrian Tracking with Monocular Camera using Unconstrained 3D Motion Model

TL;DR

This work tackles monocular pedestrian tracking without imposing a ground-plane constraint by introducing a 3D state  that combines position, velocity, and 3D extents (width  and height ). A nonlinear measurement model arises from perspective projection, and an unscented Kalman filter (UKF) is used to fuse monocular detections into 3D estimates via a carefully designed AR process for the bounding-box extents. The contributions include a complete 3D state-space formulation, interpretable process parameters (means and time constants for width/height), and a numerically stable UKF implementation with initialization strategies; evaluation on MOT-17 demonstrates consistent 2D projections and plausible 3D trajectories, with ANEES near one and RMSE improvements over 2D baselines. The approach enables depth-aware pedestrian tracking from a single camera and provides a foundation for estimating the scene ground plane from tracked trajectories in future work.

Abstract

A first-principle single-object model is proposed for pedestrian tracking. It is assumed that the extent of the moving object can be described via known statistics in 3D, such as pedestrian height. The proposed model thus need not constrain the object motion in 3D to a common ground plane, which is usual in 3D visual tracking applications. A nonlinear filter for this model is implemented using the unscented Kalman filter (UKF) and tested using the publicly available MOT-17 dataset. The proposed solution yields promising results in 3D while maintaining perfect results when projected into the 2D image. Moreover, the estimation error covariance matches the true one. Unlike conventional methods, the introduced model parameters have convenient meaning and can readily be adjusted for a problem.
Paper Structure (31 sections, 46 equations, 9 figures)

This paper contains 31 sections, 46 equations, 9 figures.

Figures (9)

  • Figure 1: Results of the proposed filter including error covariance ellipses () compared to annotations (). Faster R-CNN detections were used, and they were not available by the end of the scenario. Notice that the uncertainty is largest in the direction of the line of sight.
  • Figure 2: Illustration of the geometric transformations for a point $\mathrm{O}$ in 3D. Camera coordinates are denoted with blue, the focal plane with green, and the image plane with red. Units of measurement are meters.
  • Figure 3: Illustration of the geometric transformation of a line segment (a height) $s_{\mathrm{O}}$ and a velocity $\dot{\mathbf{r}}_{\mathrm{C,O}}$ in 3D to the corresponding variables $s_{\mathrm{F}}$ and $\dot{\mathbf{r}}_{\mathrm{F,P}}$ in the focal plane.
  • Figure 4: Scenario \ref{['sec:results:MOT17-02:id2']}: estimated individual state variables of the proposed filter with standard deviation error intervals (), semi-annotations (), and ANEES with $\mathbf{x}_k=[x_k\ y_k\ z_k\ \omega_k\ h^\mathrm{g}]^{\top}$ using Faster R-CNN detections.
  • Figure 5: Scenario \ref{['sec:results:MOT17-02:id2']}: RMSE and ANEES with $\mathbf{x}_k=[\mathsf{x}_k\ \mathsf{y}_k\ \upomega_k\ \mathsf{h}_k]^{\top}$ and Faster R-CNN detections.
  • ...and 4 more figures

Theorems & Definitions (1)

  • Definition 1: Planar BB in 3D