Table of Contents
Fetching ...

CAMOT: Camera Angle-aware Multi-Object Tracking

Felix Limanta, Kuniaki Uto, Koichi Shinoda

TL;DR

CAMOT tackles occlusion and depth misestimation in multi-object tracking by estimating a camera elevation angle from object detections under a flat-plane assumption and computing per-object depths to enable pseudo-3D MOT. It jointly optimizes a common plane and the camera angle using an iterative, Nelder–Mead-based process, followed by temporal smoothing, and integrates depth-aware coordinates into a 3D-aware Kalman filter and a camera-angle–aware association metric. When plugged into 2D MOT systems like ByteTrack, CAMOT achieves state-of-the-art HOTA, MOTA, and IDF1 on MOT17 and MOT20 with real-time performance and substantially lower computational cost than monocular depth estimators. The method is lightweight, extensible to other trackers, and offers a practical path toward robust tracking in surveillance scenarios where depth cues are scarce or expensive to compute.

Abstract

This paper proposes CAMOT, a simple camera angle estimator for multi-object tracking to tackle two problems: 1) occlusion and 2) inaccurate distance estimation in the depth direction. Under the assumption that multiple objects are located on a flat plane in each video frame, CAMOT estimates the camera angle using object detection. In addition, it gives the depth of each object, enabling pseudo-3D MOT. We evaluated its performance by adding it to various 2D MOT methods on the MOT17 and MOT20 datasets and confirmed its effectiveness. Applying CAMOT to ByteTrack, we obtained 63.8% HOTA, 80.6% MOTA, and 78.5% IDF1 in MOT17, which are state-of-the-art results. Its computational cost is significantly lower than the existing deep-learning-based depth estimators for tracking.

CAMOT: Camera Angle-aware Multi-Object Tracking

TL;DR

CAMOT tackles occlusion and depth misestimation in multi-object tracking by estimating a camera elevation angle from object detections under a flat-plane assumption and computing per-object depths to enable pseudo-3D MOT. It jointly optimizes a common plane and the camera angle using an iterative, Nelder–Mead-based process, followed by temporal smoothing, and integrates depth-aware coordinates into a 3D-aware Kalman filter and a camera-angle–aware association metric. When plugged into 2D MOT systems like ByteTrack, CAMOT achieves state-of-the-art HOTA, MOTA, and IDF1 on MOT17 and MOT20 with real-time performance and substantially lower computational cost than monocular depth estimators. The method is lightweight, extensible to other trackers, and offers a practical path toward robust tracking in surveillance scenarios where depth cues are scarce or expensive to compute.

Abstract

This paper proposes CAMOT, a simple camera angle estimator for multi-object tracking to tackle two problems: 1) occlusion and 2) inaccurate distance estimation in the depth direction. Under the assumption that multiple objects are located on a flat plane in each video frame, CAMOT estimates the camera angle using object detection. In addition, it gives the depth of each object, enabling pseudo-3D MOT. We evaluated its performance by adding it to various 2D MOT methods on the MOT17 and MOT20 datasets and confirmed its effectiveness. Applying CAMOT to ByteTrack, we obtained 63.8% HOTA, 80.6% MOTA, and 78.5% IDF1 in MOT17, which are state-of-the-art results. Its computational cost is significantly lower than the existing deep-learning-based depth estimators for tracking.
Paper Structure (25 sections, 11 equations, 3 figures, 7 tables)

This paper contains 25 sections, 11 equations, 3 figures, 7 tables.

Figures (3)

  • Figure 1: Illustration on the idea of CAMOT. Under the assumption that multiple objects are located on a flat plane, the camera angle is estimated using object detection. The scale of each bounding box indicates the depth of each object, whereas the distribution of the bounding boxes informs us of the camera angle.
  • Figure 2: 2D planar side view of the system. Black parts show the part of the system shared by all objects, whereas blue and red parts show different objects.
  • Figure 3: 2D planar side view for one object. Black parts show part of the system shared by all objects, while blue parts show components unique for the object $i$. Green parts show derived points, angles, etc., for calculation.