Table of Contents
Fetching ...

Self-supervised Monocular Depth Estimation Robust to Reflective Surface Leveraged by Triplet Mining

Wonhyeok Choi, Kyumin Hwang, Wei Peng, Minwoo Choi, Sunghoon Im

TL;DR

This work tackles the failure of self-supervised monocular depth estimation on reflective/non-Lambertian surfaces by introducing a reflection-aware training paradigm. It localizes reflective regions via triplet mining that leverages cross-view photometric discrepancies, and uses a reflection-aware loss to suppress contaminated gradients in those regions. A two-teacher distillation scheme further preserves high-frequency depth details in non-reflective areas while enhancing reflective-surface accuracy. Across indoor and outdoor benchmarks, including ScanNet, KITTI, NYU-v2, and cross-dataset tests, the proposed approach yields robust depth estimates on reflective surfaces with minimal sacrifice to non-reflective regions, marking a notable step toward reliable, plug-in depth learning for real-world scenes.

Abstract

Self-supervised monocular depth estimation (SSMDE) aims to predict the dense depth map of a monocular image, by learning depth from RGB image sequences, eliminating the need for ground-truth depth labels. Although this approach simplifies data acquisition compared to supervised methods, it struggles with reflective surfaces, as they violate the assumptions of Lambertian reflectance, leading to inaccurate training on such surfaces. To tackle this problem, we propose a novel training strategy for an SSMDE by leveraging triplet mining to pinpoint reflective regions at the pixel level, guided by the camera geometry between different viewpoints. The proposed reflection-aware triplet mining loss specifically penalizes the inappropriate photometric error minimization on the localized reflective regions while preserving depth accuracy in non-reflective areas. We also incorporate a reflection-aware knowledge distillation method that enables a student model to selectively learn the pixel-level knowledge from reflective and non-reflective regions. This results in robust depth estimation across areas. Evaluation results on multiple datasets demonstrate that our method effectively enhances depth quality on reflective surfaces and outperforms state-of-the-art SSMDE baselines.

Self-supervised Monocular Depth Estimation Robust to Reflective Surface Leveraged by Triplet Mining

TL;DR

This work tackles the failure of self-supervised monocular depth estimation on reflective/non-Lambertian surfaces by introducing a reflection-aware training paradigm. It localizes reflective regions via triplet mining that leverages cross-view photometric discrepancies, and uses a reflection-aware loss to suppress contaminated gradients in those regions. A two-teacher distillation scheme further preserves high-frequency depth details in non-reflective areas while enhancing reflective-surface accuracy. Across indoor and outdoor benchmarks, including ScanNet, KITTI, NYU-v2, and cross-dataset tests, the proposed approach yields robust depth estimates on reflective surfaces with minimal sacrifice to non-reflective regions, marking a notable step toward reliable, plug-in depth learning for real-world scenes.

Abstract

Self-supervised monocular depth estimation (SSMDE) aims to predict the dense depth map of a monocular image, by learning depth from RGB image sequences, eliminating the need for ground-truth depth labels. Although this approach simplifies data acquisition compared to supervised methods, it struggles with reflective surfaces, as they violate the assumptions of Lambertian reflectance, leading to inaccurate training on such surfaces. To tackle this problem, we propose a novel training strategy for an SSMDE by leveraging triplet mining to pinpoint reflective regions at the pixel level, guided by the camera geometry between different viewpoints. The proposed reflection-aware triplet mining loss specifically penalizes the inappropriate photometric error minimization on the localized reflective regions while preserving depth accuracy in non-reflective areas. We also incorporate a reflection-aware knowledge distillation method that enables a student model to selectively learn the pixel-level knowledge from reflective and non-reflective regions. This results in robust depth estimation across areas. Evaluation results on multiple datasets demonstrate that our method effectively enhances depth quality on reflective surfaces and outperforms state-of-the-art SSMDE baselines.

Paper Structure

This paper contains 33 sections, 8 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: Photometric constancy violation on reflective surfaces. The projected non-reflective surface point (denoted as ) satisfies the photometric constancy so the model can obtain the accurate depth by photometric error minimization. On the other hand, projected reflective surface point (denoted as ,) violates the photometric constancy, resulting in wrong disparity by photometric error minimization. This figure depicts a scenario where the relative positions of the cameras shift horizontally, akin to rectified stereo, to simplify the illustration.
  • Figure 2: The effect of the proposed method on reflective/non-reflective surfaces. (/,) imply the projected non-reflective/reflective surface points, respectively, and (, ) denotes the location of reflection lobe in view-synthesized image coordinate. Our proposed method cancels out the wrong photometric error minimization in reflection areas by contrasting the negative pair samples.
  • Figure 3: Qualitative results of the proposed methods on the ScanNet. We visualize the predicted depth of the Monodepth2 godard2019digging trained by three different methods including the proposed method: Self-supervised, Ours and Ours$^{\dagger}$. Note that the error map represents the absolute difference between prediction and ground truth depth, normalized to between 0 and 255.
  • Figure 4: Qualitative results of the proposed methods w.r.t. reflective region mask $M_r$.
  • Figure 5: Qualitative results of the proposed methods on the 7-scenes and Booster datasets. Note that the error map represents the absolute difference between prediction and ground truth depth, normalized to between 0 and 255.