DepthMOT: Depth Cues Lead to a Strong Multi-Object Tracker
Jiapeng Wu, Yichen Liu
TL;DR
DepthMOT addresses occlusion and irregular camera motion in multi-object tracking by integrating a self-supervised monocular depth estimation branch and a 6-DoF camera pose branch into a FairMOT-based framework. It computes object depth via the average depth on the bottom edge of each bounding box and uses a pose-driven transformation $p' = K R K^{-1} p + K^{-1} \tau$ to compensate Kalman-filter errors, all within an end-to-end training setup that optimizes a joint detection and depth loss with learned uncertainty weights: $\mathcal{L} = 0.5\big( e^{-w_1}\mathcal{L}_{det} + e^{-w_2}\gamma\mathcal{L}_{depth} + w_1 + w_2\big)$ with $\gamma=50$. The approach leverages depth-based cascade matching to disambiguate nearby objects and uses the estimated pose to stabilize tracking under camera motion, yielding strong results on VisDrone-MOT and competitive results on UAVDT, with ablations confirming the benefits of both depth cues and motion compensation. While introducing extra computation and relying on self-supervised depth, DepthMOT demonstrates a practical pathway to 3D-aware MOT in aerial scenarios.
Abstract
Accurately distinguishing each object is a fundamental goal of Multi-object tracking (MOT) algorithms. However, achieving this goal still remains challenging, primarily due to: (i) For crowded scenes with occluded objects, the high overlap of object bounding boxes leads to confusion among closely located objects. Nevertheless, humans naturally perceive the depth of elements in a scene when observing 2D videos. Inspired by this, even though the bounding boxes of objects are close on the camera plane, we can differentiate them in the depth dimension, thereby establishing a 3D perception of the objects. (ii) For videos with rapidly irregular camera motion, abrupt changes in object positions can result in ID switches. However, if the camera pose are known, we can compensate for the errors in linear motion models. In this paper, we propose \textit{DepthMOT}, which achieves: (i) detecting and estimating scene depth map \textit{end-to-end}, (ii) compensating the irregular camera motion by camera pose estimation. Extensive experiments demonstrate the superior performance of DepthMOT in VisDrone-MOT and UAVDT datasets. The code will be available at \url{https://github.com/JackWoo0831/DepthMOT}.
