Table of Contents
Fetching ...

Refinement of Monocular Depth Maps via Multi-View Differentiable Rendering

Laura Fink, Linus Franke, Bernhard Egger, Joachim Keinert, Marc Stamminger

TL;DR

The paper addresses the challenge of recovering accurate, view-consistent depth from monocular predictions by fusing strong monocular priors with multi-view information. It introduces an analysis-by-synthesis framework that refines a monocular depth map through a two-stage process on a meshed representation, guided by a differentiable renderer to enforce geometric and photometric consistency across views. The global scale is anchored by an SfM point cloud, and refinement proceeds from a coarse neural-field alignment to a fine, per-vertex local adjustment, augmented with a tonemapping module. Empirically, the method yields dense, high-quality depth maps that surpass several state-of-the-art multi-view approaches on indoor datasets, while highlighting limitations related to specular materials and input quality.

Abstract

Accurate depth estimation is at the core of many applications in computer graphics, vision, and robotics. Current state-of-the-art monocular depth estimators, trained on extensive datasets, generalize well but lack 3D consistency needed for many applications. In this paper, we combine the strength of those generalizing monocular depth estimation techniques with multi-view data by framing this as an analysis-by-synthesis optimization problem to lift and refine such relative depth maps to accurate error-free depth maps. After an initial global scale estimation through structure-from-motion point clouds, we further refine the depth map through optimization enforcing multi-view consistency via photometric and geometric losses with differentiable rendering of the meshed depth map. In a two-stage optimization, scaling is further refined first, and afterwards artifacts and errors in the depth map are corrected via nearby-view photometric supervision. Our evaluation shows that our method is able to generate detailed, high-quality, view consistent, accurate depth maps, also in challenging indoor scenarios, and outperforms state-of-the-art multi-view depth reconstruction approaches on such datasets. Project page and source code can be found at https://lorafib.github.io/ref_depth/.

Refinement of Monocular Depth Maps via Multi-View Differentiable Rendering

TL;DR

The paper addresses the challenge of recovering accurate, view-consistent depth from monocular predictions by fusing strong monocular priors with multi-view information. It introduces an analysis-by-synthesis framework that refines a monocular depth map through a two-stage process on a meshed representation, guided by a differentiable renderer to enforce geometric and photometric consistency across views. The global scale is anchored by an SfM point cloud, and refinement proceeds from a coarse neural-field alignment to a fine, per-vertex local adjustment, augmented with a tonemapping module. Empirically, the method yields dense, high-quality depth maps that surpass several state-of-the-art multi-view approaches on indoor datasets, while highlighting limitations related to specular materials and input quality.

Abstract

Accurate depth estimation is at the core of many applications in computer graphics, vision, and robotics. Current state-of-the-art monocular depth estimators, trained on extensive datasets, generalize well but lack 3D consistency needed for many applications. In this paper, we combine the strength of those generalizing monocular depth estimation techniques with multi-view data by framing this as an analysis-by-synthesis optimization problem to lift and refine such relative depth maps to accurate error-free depth maps. After an initial global scale estimation through structure-from-motion point clouds, we further refine the depth map through optimization enforcing multi-view consistency via photometric and geometric losses with differentiable rendering of the meshed depth map. In a two-stage optimization, scaling is further refined first, and afterwards artifacts and errors in the depth map are corrected via nearby-view photometric supervision. Our evaluation shows that our method is able to generate detailed, high-quality, view consistent, accurate depth maps, also in challenging indoor scenarios, and outperforms state-of-the-art multi-view depth reconstruction approaches on such datasets. Project page and source code can be found at https://lorafib.github.io/ref_depth/.
Paper Structure (22 sections, 18 equations, 13 figures, 12 tables)

This paper contains 22 sections, 18 equations, 13 figures, 12 tables.

Figures (13)

  • Figure 1: Our novel algorithm refines a depth map from a monocular estimator with multi-view differentiable rendering based on a meshed representation of the target depth map. This results in dense and accurate depth maps, especially in challenging indoor scenarios. Note that the ground truth here is obtained from a meshed LiDAR scan, as such uncertain areas are left blank.
  • Figure 2: Warping with depth maps reveals inconsistencies and artifacts. Absolute depth maps yang2024depth estimated with least squares display missalignments upon warping. In contrast, our approach, which combines estimation and optimization, effectively preserves and enforces multi-view consistency.
  • Figure 3: Overview of our method. We employ monocular depth estimation for a relative, but topologically complete depth map. Results from Structure-from-Motion are used to scale the depth map to absolute space. Following, we convert the depth map to a surface mesh for refinement via differentiable rendering. The refinement is done in two consecutive steps: first, we learn a mapping function that smoothly aligns the depth map to the sparse point cloud and second, we refine per-vertex positions, yielding accurate depth maps.
  • Figure 4: Histograms of different states of depth maps during our optimization. The initial estimate (a) is in relative space (red) and taken as input to be refined (orange). Our absolute estimation (b) brings this to absolute space by aligning the medians and 0.1th percentiles. Then, we optimize (c) to closely capture the depth distribution of the ground truth (red). Missing values, like windows, are counted as 0 m.
  • Figure 5: Visualization of our depth mesh $\mathbf{M}$, where triangle size varies with scene complexity. Numbers in lower left corner indicate decimation ratio $r$.
  • ...and 8 more figures