Self-supervised Monocular Depth Estimation on Water Scenes via Specular Reflection Prior
Zhengyang Lu, Ying Chen
TL;DR
This work tackles monocular depth estimation in water scenes by exploiting intra-frame priors from specular reflections to reformulate depth estimation as multi-view synthesis within a single image. It introduces a two-stage framework combining water surface segmentation with self-supervised depth estimation guided by a photometric re-projection loss augmented with PASSIM, plus a plane-depth complement that leverages real/virtual camera poses. A new Water Reflection Scene (WRS) dataset rendered in Unreal Engine 4 supports training and evaluation, and extensive experiments show substantial improvements over prior self-supervised methods, particularly on reflective water regions, while remaining efficient in model size. The approach expands the applicability of self-supervised depth estimation to reflective scenes and opens avenues for leveraging unlabelled web data, though it relies on clear specular reflections and accurate plane-depth reasoning for water surfaces.
Abstract
Monocular depth estimation from a single image is an ill-posed problem for computer vision due to insufficient reliable cues as the prior knowledge. Besides the inter-frame supervision, namely stereo and adjacent frames, extensive prior information is available in the same frame. Reflections from specular surfaces, informative intra-frame priors, enable us to reformulate the ill-posed depth estimation task as a multi-view synthesis. This paper proposes the first self-supervision for deep-learning depth estimation on water scenes via intra-frame priors, known as reflection supervision and geometrical constraints. In the first stage, a water segmentation network is performed to separate the reflection components from the entire image. Next, we construct a self-supervised framework to predict the target appearance from reflections, perceived as other perspectives. The photometric re-projection error, incorporating SmoothL1 and a novel photometric adaptive SSIM, is formulated to optimize pose and depth estimation by aligning the transformed virtual depths and source ones. As a supplement, the water surface is determined from real and virtual camera positions, which complement the depth of the water area. Furthermore, to alleviate these laborious ground truth annotations, we introduce a large-scale water reflection scene (WRS) dataset rendered from Unreal Engine 4. Extensive experiments on the WRS dataset prove the feasibility of the proposed method compared to state-of-the-art depth estimation techniques.
