Table of Contents
Fetching ...

Self-supervised Monocular Depth Estimation on Water Scenes via Specular Reflection Prior

Zhengyang Lu, Ying Chen

TL;DR

This work tackles monocular depth estimation in water scenes by exploiting intra-frame priors from specular reflections to reformulate depth estimation as multi-view synthesis within a single image. It introduces a two-stage framework combining water surface segmentation with self-supervised depth estimation guided by a photometric re-projection loss augmented with PASSIM, plus a plane-depth complement that leverages real/virtual camera poses. A new Water Reflection Scene (WRS) dataset rendered in Unreal Engine 4 supports training and evaluation, and extensive experiments show substantial improvements over prior self-supervised methods, particularly on reflective water regions, while remaining efficient in model size. The approach expands the applicability of self-supervised depth estimation to reflective scenes and opens avenues for leveraging unlabelled web data, though it relies on clear specular reflections and accurate plane-depth reasoning for water surfaces.

Abstract

Monocular depth estimation from a single image is an ill-posed problem for computer vision due to insufficient reliable cues as the prior knowledge. Besides the inter-frame supervision, namely stereo and adjacent frames, extensive prior information is available in the same frame. Reflections from specular surfaces, informative intra-frame priors, enable us to reformulate the ill-posed depth estimation task as a multi-view synthesis. This paper proposes the first self-supervision for deep-learning depth estimation on water scenes via intra-frame priors, known as reflection supervision and geometrical constraints. In the first stage, a water segmentation network is performed to separate the reflection components from the entire image. Next, we construct a self-supervised framework to predict the target appearance from reflections, perceived as other perspectives. The photometric re-projection error, incorporating SmoothL1 and a novel photometric adaptive SSIM, is formulated to optimize pose and depth estimation by aligning the transformed virtual depths and source ones. As a supplement, the water surface is determined from real and virtual camera positions, which complement the depth of the water area. Furthermore, to alleviate these laborious ground truth annotations, we introduce a large-scale water reflection scene (WRS) dataset rendered from Unreal Engine 4. Extensive experiments on the WRS dataset prove the feasibility of the proposed method compared to state-of-the-art depth estimation techniques.

Self-supervised Monocular Depth Estimation on Water Scenes via Specular Reflection Prior

TL;DR

This work tackles monocular depth estimation in water scenes by exploiting intra-frame priors from specular reflections to reformulate depth estimation as multi-view synthesis within a single image. It introduces a two-stage framework combining water surface segmentation with self-supervised depth estimation guided by a photometric re-projection loss augmented with PASSIM, plus a plane-depth complement that leverages real/virtual camera poses. A new Water Reflection Scene (WRS) dataset rendered in Unreal Engine 4 supports training and evaluation, and extensive experiments show substantial improvements over prior self-supervised methods, particularly on reflective water regions, while remaining efficient in model size. The approach expands the applicability of self-supervised depth estimation to reflective scenes and opens avenues for leveraging unlabelled web data, though it relies on clear specular reflections and accurate plane-depth reasoning for water surfaces.

Abstract

Monocular depth estimation from a single image is an ill-posed problem for computer vision due to insufficient reliable cues as the prior knowledge. Besides the inter-frame supervision, namely stereo and adjacent frames, extensive prior information is available in the same frame. Reflections from specular surfaces, informative intra-frame priors, enable us to reformulate the ill-posed depth estimation task as a multi-view synthesis. This paper proposes the first self-supervision for deep-learning depth estimation on water scenes via intra-frame priors, known as reflection supervision and geometrical constraints. In the first stage, a water segmentation network is performed to separate the reflection components from the entire image. Next, we construct a self-supervised framework to predict the target appearance from reflections, perceived as other perspectives. The photometric re-projection error, incorporating SmoothL1 and a novel photometric adaptive SSIM, is formulated to optimize pose and depth estimation by aligning the transformed virtual depths and source ones. As a supplement, the water surface is determined from real and virtual camera positions, which complement the depth of the water area. Furthermore, to alleviate these laborious ground truth annotations, we introduce a large-scale water reflection scene (WRS) dataset rendered from Unreal Engine 4. Extensive experiments on the WRS dataset prove the feasibility of the proposed method compared to state-of-the-art depth estimation techniques.
Paper Structure (18 sections, 11 equations, 8 figures, 6 tables, 1 algorithm)

This paper contains 18 sections, 11 equations, 8 figures, 6 tables, 1 algorithm.

Figures (8)

  • Figure 1: Schematic of depth estimation on the reflective scene. The simultaneous appearance of the inverted and raw image reformulates the ill-posed depth estimation task as an interpretable multi-view synthesis problem.
  • Figure 2: Differences in inter- and intra-frame supervision methods. Reflection information enables self-supervised depth estimation in single frames.
  • Figure 3: Depth from a single image in water reflection scenes. DORN, an end-to-end depth estimation method, produces blurred, deluded results.
  • Figure 4: Complete framework of the proposed depth estimation method. The water segmentation network employs a standard convolutional U-Net for water area prediction. Similarly, the depth network employs a standard U-Net for depth prediction.The independent pose network predicts perspectives between real and virtual images.
  • Figure 5: Colour distribution of inverted and source image. It can noticed that the distribution of inverted images is roughly proportionally decreased versus the source one.
  • ...and 3 more figures