DepthMOT: Depth Cues Lead to a Strong Multi-Object Tracker

Jiapeng Wu; Yichen Liu

DepthMOT: Depth Cues Lead to a Strong Multi-Object Tracker

Jiapeng Wu, Yichen Liu

TL;DR

DepthMOT addresses occlusion and irregular camera motion in multi-object tracking by integrating a self-supervised monocular depth estimation branch and a 6-DoF camera pose branch into a FairMOT-based framework. It computes object depth via the average depth on the bottom edge of each bounding box and uses a pose-driven transformation $p' = K R K^{-1} p + K^{-1} \tau$ to compensate Kalman-filter errors, all within an end-to-end training setup that optimizes a joint detection and depth loss with learned uncertainty weights: $\mathcal{L} = 0.5\big( e^{-w_1}\mathcal{L}_{det} + e^{-w_2}\gamma\mathcal{L}_{depth} + w_1 + w_2\big)$ with $\gamma=50$. The approach leverages depth-based cascade matching to disambiguate nearby objects and uses the estimated pose to stabilize tracking under camera motion, yielding strong results on VisDrone-MOT and competitive results on UAVDT, with ablations confirming the benefits of both depth cues and motion compensation. While introducing extra computation and relying on self-supervised depth, DepthMOT demonstrates a practical pathway to 3D-aware MOT in aerial scenarios.

Abstract

Accurately distinguishing each object is a fundamental goal of Multi-object tracking (MOT) algorithms. However, achieving this goal still remains challenging, primarily due to: (i) For crowded scenes with occluded objects, the high overlap of object bounding boxes leads to confusion among closely located objects. Nevertheless, humans naturally perceive the depth of elements in a scene when observing 2D videos. Inspired by this, even though the bounding boxes of objects are close on the camera plane, we can differentiate them in the depth dimension, thereby establishing a 3D perception of the objects. (ii) For videos with rapidly irregular camera motion, abrupt changes in object positions can result in ID switches. However, if the camera pose are known, we can compensate for the errors in linear motion models. In this paper, we propose \textit{DepthMOT}, which achieves: (i) detecting and estimating scene depth map \textit{end-to-end}, (ii) compensating the irregular camera motion by camera pose estimation. Extensive experiments demonstrate the superior performance of DepthMOT in VisDrone-MOT and UAVDT datasets. The code will be available at \url{https://github.com/JackWoo0831/DepthMOT}.

DepthMOT: Depth Cues Lead to a Strong Multi-Object Tracker

TL;DR

to compensate Kalman-filter errors, all within an end-to-end training setup that optimizes a joint detection and depth loss with learned uncertainty weights:

with

. The approach leverages depth-based cascade matching to disambiguate nearby objects and uses the estimated pose to stabilize tracking under camera motion, yielding strong results on VisDrone-MOT and competitive results on UAVDT, with ablations confirming the benefits of both depth cues and motion compensation. While introducing extra computation and relying on self-supervised depth, DepthMOT demonstrates a practical pathway to 3D-aware MOT in aerial scenarios.

Abstract

Paper Structure (24 sections, 14 equations, 6 figures, 3 tables)

This paper contains 24 sections, 14 equations, 6 figures, 3 tables.

Introduction
Related Work
Multi-Object Tracking
Depth Estimation
Depth cues in MOT
Methodology
Preliminaries
DepthMOT
Depth Branch
Pose Branch
Training
Detection Loss
Depth Loss
Overall Loss
Inference
...and 9 more sections

Figures (6)

Figure 1: Motivation of our work. (a) When objects occlude each other, we can distinguish them by the depth information. (b) Since depth estimation requires information about the camera pose, concurrently with depth estimation, we can also correct errors in the linear motion model (such as Kalman Filter) under irregular camera motion by changes in camera pose.
Figure 2: Flowchart of training process of self-supervised monocular depth estimation.
Figure 3: Diagram of our proposed DepthMOT
Figure 4: Visualization results on challening scenes in Visdrone test set. The number on the up-left corner of bounding boxes indicates object IDs.
Figure 5: Visualization results on challening scenes in UAVDT test set. The number on the up-left corner of bounding boxes indicates object IDs.
...and 1 more figures

DepthMOT: Depth Cues Lead to a Strong Multi-Object Tracker

TL;DR

Abstract

DepthMOT: Depth Cues Lead to a Strong Multi-Object Tracker

Authors

TL;DR

Abstract

Table of Contents

Figures (6)