Table of Contents
Fetching ...

One Homography is All You Need: IMM-based Joint Homography and Multiple Object State Estimation

Paul Johannes Claasen, Johan Pieter de Villiers

TL;DR

IMM-JHSE introduces an online MOT framework that jointly estimates ground-plane target states and per-track homography via an interacting multiple model (IMM) filter, using a single initial homography and a ground-plane projection to the image plane. It decouples camera motion from target dynamics while blending ground-plane Mahalanobis distances with image-plane bounding-box cues through a dynamic association strategy that switches emphasis as targets move relative to the ground plane. Dynamic noise estimation for both process and measurement steps enhances robustness to occlusions and out-of-plane motion, yielding strong performance on DanceTrack and KITTI-car datasets and competitive results on MOT and KITTI-pedestrian datasets, all using publicly available detections. Overall, IMM-JHSE narrows the gap between 2D and 3D MOT and between tracking-by-detection and tracking-by-attention approaches, offering a scalable alternative that leverages ground-plane information without requiring full 3D measurements in every frame.

Abstract

A novel online MOT algorithm, IMM Joint Homography State Estimation (IMM-JHSE), is proposed. IMM-JHSE uses an initial homography estimate as the only additional 3D information, whereas other 3D MOT methods use regular 3D measurements. By jointly modelling the homography matrix and its dynamics as part of track state vectors, IMM-JHSE removes the explicit influence of camera motion compensation techniques on predicted track position states, which was prevalent in previous approaches. Expanding upon this, static and dynamic camera motion models are combined using an IMM filter. A simple bounding box motion model is used to predict bounding box positions to incorporate image plane information. In addition to applying an IMM to camera motion, a non-standard IMM approach is applied where bounding-box-based BIoU scores are mixed with ground-plane-based Mahalanobis distances in an IMM-like fashion to perform association only, making IMM-JHSE robust to motion away from the ground plane. Finally, IMM-JHSE makes use of dynamic process and measurement noise estimation techniques. IMM-JHSE improves upon related techniques, including UCMCTrack, OC-SORT, C-BIoU and ByteTrack on the DanceTrack and KITTI-car datasets, increasing HOTA by 2.64 and 2.11, respectively, while offering competitive performance on the MOT17, MOT20 and KITTI-pedestrian datasets. Using publicly available detections, IMM-JHSE outperforms almost all other 2D MOT methods and is outperformed only by 3D MOT methods -- some of which are offline -- on the KITTI-car dataset. Compared to tracking-by-attention methods, IMM-JHSE shows remarkably similar performance on the DanceTrack dataset and outperforms them on the MOT17 dataset. The code is publicly available: https://github.com/Paulkie99/imm-jhse.

One Homography is All You Need: IMM-based Joint Homography and Multiple Object State Estimation

TL;DR

IMM-JHSE introduces an online MOT framework that jointly estimates ground-plane target states and per-track homography via an interacting multiple model (IMM) filter, using a single initial homography and a ground-plane projection to the image plane. It decouples camera motion from target dynamics while blending ground-plane Mahalanobis distances with image-plane bounding-box cues through a dynamic association strategy that switches emphasis as targets move relative to the ground plane. Dynamic noise estimation for both process and measurement steps enhances robustness to occlusions and out-of-plane motion, yielding strong performance on DanceTrack and KITTI-car datasets and competitive results on MOT and KITTI-pedestrian datasets, all using publicly available detections. Overall, IMM-JHSE narrows the gap between 2D and 3D MOT and between tracking-by-detection and tracking-by-attention approaches, offering a scalable alternative that leverages ground-plane information without requiring full 3D measurements in every frame.

Abstract

A novel online MOT algorithm, IMM Joint Homography State Estimation (IMM-JHSE), is proposed. IMM-JHSE uses an initial homography estimate as the only additional 3D information, whereas other 3D MOT methods use regular 3D measurements. By jointly modelling the homography matrix and its dynamics as part of track state vectors, IMM-JHSE removes the explicit influence of camera motion compensation techniques on predicted track position states, which was prevalent in previous approaches. Expanding upon this, static and dynamic camera motion models are combined using an IMM filter. A simple bounding box motion model is used to predict bounding box positions to incorporate image plane information. In addition to applying an IMM to camera motion, a non-standard IMM approach is applied where bounding-box-based BIoU scores are mixed with ground-plane-based Mahalanobis distances in an IMM-like fashion to perform association only, making IMM-JHSE robust to motion away from the ground plane. Finally, IMM-JHSE makes use of dynamic process and measurement noise estimation techniques. IMM-JHSE improves upon related techniques, including UCMCTrack, OC-SORT, C-BIoU and ByteTrack on the DanceTrack and KITTI-car datasets, increasing HOTA by 2.64 and 2.11, respectively, while offering competitive performance on the MOT17, MOT20 and KITTI-pedestrian datasets. Using publicly available detections, IMM-JHSE outperforms almost all other 2D MOT methods and is outperformed only by 3D MOT methods -- some of which are offline -- on the KITTI-car dataset. Compared to tracking-by-attention methods, IMM-JHSE shows remarkably similar performance on the DanceTrack dataset and outperforms them on the MOT17 dataset. The code is publicly available: https://github.com/Paulkie99/imm-jhse.
Paper Structure (34 sections, 51 equations, 12 figures, 9 tables)

This paper contains 34 sections, 51 equations, 12 figures, 9 tables.

Figures (12)

  • Figure 1: Illustration of the potential benefit of using ground plane location estimates. While the blue track occludes the red track in the image plane at time $t$, their ground plane locations are still easily separable.
  • Figure 2: The graphical model usually employed in modern MOT approaches. $\mathbf{x}^I_t$ represents the state of a bounding box element (x/y position, width or height), and $\mathbf{y}^I_t$ denotes the corresponding measurement, i.e. an element of the detected bounding box.
  • Figure 3: The graphical model representing ground plane and camera motion as independent stochastic processes that become dependent when conditioned on the observed bounding box measurement.
  • Figure 4: Showing how coupling bounding box predictions to ground plane state estimates may benefit re-identification. Green bounding boxes represent confirmed tracks, while red represents coasted tracks. The track ID is shown at the top-left above each bounding box. Blue dots represent the ground plane position of the corresponding confirmed track as projected into the image plane, while a red dot represents the same for a coasted track. By coupling the bounding box prediction for track 3 to its predicted ground plane position, it can be re-identified with the BIoU score. The predicted bounding box goes off-screen without this coupling, and the track cannot be re-identified.
  • Figure 5: A graphical representation of the proposed method. Dynamic and static camera motion models are combined with an IMM filter. During association, the normalised Mahlanbobis distances from this filter and the corresponding BIoU scores are weighted proportionally as described in Section \ref{['assoc_alg']}. Since the states of the image and ground plane filters are not mixed, the state transitions between them are depicted with dashed lines. The transition probabilities are determined as explained in Section \ref{['sec:exps']}.
  • ...and 7 more figures