Table of Contents
Fetching ...

Enhanced Object Tracking by Self-Supervised Auxiliary Depth Estimation Learning

Zhenyu Wei, Yujie He, Zhanchuan Cai

TL;DR

MDETrack tackles the limitation of RGB-D tracking by introducing an auxiliary monocular depth estimation task trained jointly with tracking. The framework uses a unified, lightweight feature extractor and discards the depth branch during inference to preserve speed, while enabling improved depth-aware representations during training through both supervised and self-supervised signals. Across LaSOT, GOT-10K, DepthTrack, and VOT-RGBD2022, self-supervised auxiliary depth learning yields robust gains with minimal or no loss in efficiency, highlighting depth estimation as a valuable training signal rather than an inference burden. The work demonstrates that depth perception can be effectively leveraged to enhance 2D tracking performance in diverse and data-constrained scenarios, expanding practical applicability of RGB-based trackers.

Abstract

RGB-D tracking significantly improves the accuracy of object tracking. However, its dependency on real depth inputs and the complexity involved in multi-modal fusion limit its applicability across various scenarios. The utilization of depth information in RGB-D tracking inspired us to propose a new method, named MDETrack, which trains a tracking network with an additional capability to understand the depth of scenes, through supervised or self-supervised auxiliary Monocular Depth Estimation learning. The outputs of MDETrack's unified feature extractor are fed to the side-by-side tracking head and auxiliary depth estimation head, respectively. The auxiliary module will be discarded in inference, thus keeping the same inference speed. We evaluated our models with various training strategies on multiple datasets, and the results show an improved tracking accuracy even without real depth. Through these findings we highlight the potential of depth estimation in enhancing object tracking performance.

Enhanced Object Tracking by Self-Supervised Auxiliary Depth Estimation Learning

TL;DR

MDETrack tackles the limitation of RGB-D tracking by introducing an auxiliary monocular depth estimation task trained jointly with tracking. The framework uses a unified, lightweight feature extractor and discards the depth branch during inference to preserve speed, while enabling improved depth-aware representations during training through both supervised and self-supervised signals. Across LaSOT, GOT-10K, DepthTrack, and VOT-RGBD2022, self-supervised auxiliary depth learning yields robust gains with minimal or no loss in efficiency, highlighting depth estimation as a valuable training signal rather than an inference burden. The work demonstrates that depth perception can be effectively leveraged to enhance 2D tracking performance in diverse and data-constrained scenarios, expanding practical applicability of RGB-based trackers.

Abstract

RGB-D tracking significantly improves the accuracy of object tracking. However, its dependency on real depth inputs and the complexity involved in multi-modal fusion limit its applicability across various scenarios. The utilization of depth information in RGB-D tracking inspired us to propose a new method, named MDETrack, which trains a tracking network with an additional capability to understand the depth of scenes, through supervised or self-supervised auxiliary Monocular Depth Estimation learning. The outputs of MDETrack's unified feature extractor are fed to the side-by-side tracking head and auxiliary depth estimation head, respectively. The auxiliary module will be discarded in inference, thus keeping the same inference speed. We evaluated our models with various training strategies on multiple datasets, and the results show an improved tracking accuracy even without real depth. Through these findings we highlight the potential of depth estimation in enhancing object tracking performance.
Paper Structure (39 sections, 9 equations, 4 figures, 3 tables)

This paper contains 39 sections, 9 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Comparison between existing multi-modal paradigm and MDETrack. (a) RGB-D tracking (using real depth) and monocular 3D detection (using estimated depth). (b) MDETrack learns monocular depth estimation through an auxiliary module during training. (c) MDETrack discards the auxiliary module to achieve faster inference.
  • Figure 2: Overview. The adjacent frame pair $X=\{I_{t'},I_{t}\}$ generated by data preprocessing is sent for camera pose prediction, and only $I_t$ is sent to the patch embedding module. The dark grey region #1 is used as an auxiliary learning branch in supervised training, while region #2 is further enabled for self-supervised auxiliary training.
  • Figure 3: Left: The first row demonstrates the way to compose a sequence of frames $X=\{I_{t'},I_{t}\}$ by random sampling, with $R$ as the sampling range. The second row shows the padding mask generation process. Right: Relationship between the padded image and padding mask.
  • Figure 4: Visualization of the attention maps in tracking head. Networks with auxiliary depth estimation possess an improved capability to discern the 3D structure of the scene, thereby enabling a more concentrated focus on the expected target.