Table of Contents
Fetching ...

DCPI-Depth: Explicitly Infusing Dense Correspondence Prior to Unsupervised Monocular Depth Estimation

Mengtan Zhang, Yi Feng, Qijun Chen, Rui Fan

TL;DR

This paper tackles unsupervised monocular depth estimation by addressing weaknesses in RGB-context learning, edge-aware smoothness, and weak supervision in textureless or dynamic regions. It proposes DCPI-Depth, a two-stream framework that fuses photometric guidance (PCG) with dense correspondence priors (CPG) through two geometric losses: CGDC, which aligns a geometry-based depth from triangulation with a context-based depth, and DPC, which links optical flow divergence to depth gradient. A bidirectional stream co-adjustment (BSCA) strategy further harmonizes rigid and optical flows to improve depth in dynamic scenes. The approach achieves state-of-the-art results across six public datasets and demonstrates strong generalizability to unseen environments, with plans to release code publicly to enable broader adoption and benchmarking.

Abstract

There has been a recent surge of interest in learning to perceive depth from monocular videos in an unsupervised fashion. A key challenge in this field is achieving robust and accurate depth estimation in challenging scenarios, particularly in regions with weak textures or where dynamic objects are present. This study makes three major contributions by delving deeply into dense correspondence priors to provide existing frameworks with explicit geometric constraints. The first novelty is a contextual-geometric depth consistency loss, which employs depth maps triangulated from dense correspondences based on estimated ego-motion to guide the learning of depth perception from contextual information, since explicitly triangulated depth maps capture accurate relative distances among pixels. The second novelty arises from the observation that there exists an explicit, deducible relationship between optical flow divergence and depth gradient. A differential property correlation loss is, therefore, designed to refine depth estimation with a specific emphasis on local variations. The third novelty is a bidirectional stream co-adjustment strategy that enhances the interaction between rigid and optical flows, encouraging the former towards more accurate correspondence and making the latter more adaptable across various scenarios under the static scene hypotheses. DCPI-Depth, a framework that incorporates all these innovative components and couples two bidirectional and collaborative streams, achieves state-of-the-art performance and generalizability across multiple public datasets, outperforming all existing prior arts. Specifically, it demonstrates accurate depth estimation in texture-less and dynamic regions, and shows more reasonable smoothness. Our source code will be publicly available at mias.group/DCPI-Depth upon publication.

DCPI-Depth: Explicitly Infusing Dense Correspondence Prior to Unsupervised Monocular Depth Estimation

TL;DR

This paper tackles unsupervised monocular depth estimation by addressing weaknesses in RGB-context learning, edge-aware smoothness, and weak supervision in textureless or dynamic regions. It proposes DCPI-Depth, a two-stream framework that fuses photometric guidance (PCG) with dense correspondence priors (CPG) through two geometric losses: CGDC, which aligns a geometry-based depth from triangulation with a context-based depth, and DPC, which links optical flow divergence to depth gradient. A bidirectional stream co-adjustment (BSCA) strategy further harmonizes rigid and optical flows to improve depth in dynamic scenes. The approach achieves state-of-the-art results across six public datasets and demonstrates strong generalizability to unseen environments, with plans to release code publicly to enable broader adoption and benchmarking.

Abstract

There has been a recent surge of interest in learning to perceive depth from monocular videos in an unsupervised fashion. A key challenge in this field is achieving robust and accurate depth estimation in challenging scenarios, particularly in regions with weak textures or where dynamic objects are present. This study makes three major contributions by delving deeply into dense correspondence priors to provide existing frameworks with explicit geometric constraints. The first novelty is a contextual-geometric depth consistency loss, which employs depth maps triangulated from dense correspondences based on estimated ego-motion to guide the learning of depth perception from contextual information, since explicitly triangulated depth maps capture accurate relative distances among pixels. The second novelty arises from the observation that there exists an explicit, deducible relationship between optical flow divergence and depth gradient. A differential property correlation loss is, therefore, designed to refine depth estimation with a specific emphasis on local variations. The third novelty is a bidirectional stream co-adjustment strategy that enhances the interaction between rigid and optical flows, encouraging the former towards more accurate correspondence and making the latter more adaptable across various scenarios under the static scene hypotheses. DCPI-Depth, a framework that incorporates all these innovative components and couples two bidirectional and collaborative streams, achieves state-of-the-art performance and generalizability across multiple public datasets, outperforming all existing prior arts. Specifically, it demonstrates accurate depth estimation in texture-less and dynamic regions, and shows more reasonable smoothness. Our source code will be publicly available at mias.group/DCPI-Depth upon publication.
Paper Structure (17 sections, 16 equations, 8 figures, 6 tables)

This paper contains 17 sections, 16 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: The overall architecture of our proposed DCPI-Depth framework, which consists of two collaborative and bidirectional streams: PCG and CPG. The input image pairs, PoseNet, and the estimated ego-motion are depicted separately in each stream.
  • Figure 2: Illustrations of optical flow divergence and auto-masking result: (a) optical flow divergence for pixels with similar intensities yet being spatially discontinuous; (b) optical flow divergence for pixels with significantly different intensities yet being spatially continuous; (c) auto-masking result for the given source and target images. A given pixel and its four neighbors in the target image are utilized for visualization in (a) and (b), where it can be observed that their correspondences in the source image are widely separated in (a) but similarly distributed in (b). The auto-masking algorithm tends to overly mask static regions, particularly in low-texture areas or when overexposed, and cannot effectively mask dynamic objects.
  • Figure 3: An illustration of the interaction between PCG and CPG streams through the proposed BSCA strategy to address the challenges posed by dynamic objects.
  • Figure 4: Qualitative comparisons among Monodepth2, Lite-Mono-8M, and our proposed DCPI-Depth on the KITTI geiger2012we dataset. (a)-(b), (c)-(d), (e)-(f), and (g)-(h) demonstrate the robustness of DCPI-Depth in texture-less regions, in texture-rich regions, at static object boundaries, and on dynamic objects, respectively.
  • Figure 5: Qualitative comparisons between Lite-Mono and our proposed DCPI-Depth on the DDAD guizilini20203d, nuScenes caesar2020nuscenes and Waymo Open mei2022waymo datasets.
  • ...and 3 more figures