Intrinsic Image Decomposition for Robust Self-supervised Monocular Depth Estimation on Reflective Surfaces
Wonhyeok Choi, Kyumin Hwang, Minwoo Choi, Kiljoon Han, Wonjoon Choi, Mingyu Shin, Sunghoon Im
TL;DR
The paper tackles the core challenge of self-supervised monocular depth estimation in the presence of reflective, non-Lambertian surfaces. It introduces an end-to-end framework with two synergistic branches: intrinsic image decomposition and depth estimation, where the intrinsic branch identifies diffuse and residual components and the depth branch leverages this information to exclude reflective regions from supervision. A Mahalanobis-based masking strategy and a contrastive intrinsic loss term stabilize training and improve depth in reflective areas, with a distillation component that further boosts performance while reducing computational cost. Across indoor datasets featuring reflective content, the method yields substantial depth-accuracy gains on reflective surfaces and maintains strong generalization on non-reflective scenes, offering a practical advance over traditional photometric-based SSMDE and multi-stage distillation approaches.
Abstract
Self-supervised monocular depth estimation (SSMDE) has gained attention in the field of deep learning as it estimates depth without requiring ground truth depth maps. This approach typically uses a photometric consistency loss between a synthesized image, generated from the estimated depth, and the original image, thereby reducing the need for extensive dataset acquisition. However, the conventional photometric consistency loss relies on the Lambertian assumption, which often leads to significant errors when dealing with reflective surfaces that deviate from this model. To address this limitation, we propose a novel framework that incorporates intrinsic image decomposition into SSMDE. Our method synergistically trains for both monocular depth estimation and intrinsic image decomposition. The accurate depth estimation facilitates multi-image consistency for intrinsic image decomposition by aligning different view coordinate systems, while the decomposition process identifies reflective areas and excludes corrupted gradients from the depth training process. Furthermore, our framework introduces a pseudo-depth generation and knowledge distillation technique to further enhance the performance of the student model across both reflective and non-reflective surfaces. Comprehensive evaluations on multiple datasets show that our approach significantly outperforms existing SSMDE baselines in depth prediction, especially on reflective surfaces.
