Table of Contents
Fetching ...

Stereo-Matching Knowledge Distilled Monocular Depth Estimation Filtered by Multiple Disparity Consistency

Woonghyun Ka, Jae Young Lee, Jaehyun Choi, Junmo Kim

TL;DR

Self-supervised monocular depth estimation often suffers from errors in pseudo-depth generated by stereo-matching networks. This work proposes a GT-free filtering mechanism based on consistency across multiple disparity maps obtained via disparity plane sweep, producing a weight map to down-weight unreliable regions during training. The weight map modulates the depth regression loss, enabling the monocular network to learn from accurate pseudo-depth without additional GT or stereo-confidence training. Experiments on KITTI Eigen split and Cityscapes demonstrate improved accuracy and robustness across backbone and stereo-network configurations, with qualitative gains at object boundaries and in challenging regions.

Abstract

In stereo-matching knowledge distillation methods of the self-supervised monocular depth estimation, the stereo-matching network's knowledge is distilled into a monocular depth network through pseudo-depth maps. In these methods, the learning-based stereo-confidence network is generally utilized to identify errors in the pseudo-depth maps to prevent transferring the errors. However, the learning-based stereo-confidence networks should be trained with ground truth (GT), which is not feasible in a self-supervised setting. In this paper, we propose a method to identify and filter errors in the pseudo-depth map using multiple disparity maps by checking their consistency without the need for GT and a training process. Experimental results show that the proposed method outperforms the previous methods and works well on various configurations by filtering out erroneous areas where the stereo-matching is vulnerable, especially such as textureless regions, occlusion boundaries, and reflective surfaces.

Stereo-Matching Knowledge Distilled Monocular Depth Estimation Filtered by Multiple Disparity Consistency

TL;DR

Self-supervised monocular depth estimation often suffers from errors in pseudo-depth generated by stereo-matching networks. This work proposes a GT-free filtering mechanism based on consistency across multiple disparity maps obtained via disparity plane sweep, producing a weight map to down-weight unreliable regions during training. The weight map modulates the depth regression loss, enabling the monocular network to learn from accurate pseudo-depth without additional GT or stereo-confidence training. Experiments on KITTI Eigen split and Cityscapes demonstrate improved accuracy and robustness across backbone and stereo-network configurations, with qualitative gains at object boundaries and in challenging regions.

Abstract

In stereo-matching knowledge distillation methods of the self-supervised monocular depth estimation, the stereo-matching network's knowledge is distilled into a monocular depth network through pseudo-depth maps. In these methods, the learning-based stereo-confidence network is generally utilized to identify errors in the pseudo-depth maps to prevent transferring the errors. However, the learning-based stereo-confidence networks should be trained with ground truth (GT), which is not feasible in a self-supervised setting. In this paper, we propose a method to identify and filter errors in the pseudo-depth map using multiple disparity maps by checking their consistency without the need for GT and a training process. Experimental results show that the proposed method outperforms the previous methods and works well on various configurations by filtering out erroneous areas where the stereo-matching is vulnerable, especially such as textureless regions, occlusion boundaries, and reflective surfaces.
Paper Structure (11 sections, 3 equations, 4 figures, 3 tables)

This paper contains 11 sections, 3 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Principle to obtain depth from a stereo camera system.
  • Figure 2: Overall framework of the proposed method.
  • Figure 3: Disparity profile observation.
  • Figure 4: Qualitative comparison with the previous methods on the KITTI Eigen split dataset. $d_{s}^{0}$ and $W$ denote the predicted disparity map of the stereo-matching network watson2020learning and the weight map of the proposed method (Ours), respectively.