Table of Contents
Fetching ...

DoubleTake: Geometry Guided Depth Estimation

Mohamed Sayed, Filippo Aleotti, Jamie Watson, Zawar Qureshi, Guillermo Garcia-Hernando, Gabriel Brostow, Sara Vicente, Michael Firman

TL;DR

This work tackles interactive depth estimation from sequences of posed RGB images by injecting prior geometric information into a depth predictor. It introduces a Hint MLP that fuses multi-view stereo cost volume with a rendered depth and a confidence map derived from a continually updated TSDF-based 3D reconstruction, enabling robust depth predictions even when hints are incomplete or absent. The persistent geometry is maintained via TSDF fusion and rendered on demand, with a training regime that exposes the model to varied hint availability and a two-pass evaluation to leverage full scene geometry. Empirically, the method achieves state-of-the-art depth and 3D reconstruction on ScanNetV2, 7-Scenes, and 3RScan, while delivering interactive runtimes and resilience to pose errors and scene changes, though it remains limited to observed geometry and faces challenges with transparent or reflective surfaces.

Abstract

Estimating depth from a sequence of posed RGB images is a fundamental computer vision task, with applications in augmented reality, path planning etc. Prior work typically makes use of previous frames in a multi view stereo framework, relying on matching textures in a local neighborhood. In contrast, our model leverages historical predictions by giving the latest 3D geometry data as an extra input to our network. This self-generated geometric hint can encode information from areas of the scene not covered by the keyframes and it is more regularized when compared to individual predicted depth maps for previous frames. We introduce a Hint MLP which combines cost volume features with a hint of the prior geometry, rendered as a depth map from the current camera location, together with a measure of the confidence in the prior geometry. We demonstrate that our method, which can run at interactive speeds, achieves state-of-the-art estimates of depth and 3D scene reconstruction in both offline and incremental evaluation scenarios.

DoubleTake: Geometry Guided Depth Estimation

TL;DR

This work tackles interactive depth estimation from sequences of posed RGB images by injecting prior geometric information into a depth predictor. It introduces a Hint MLP that fuses multi-view stereo cost volume with a rendered depth and a confidence map derived from a continually updated TSDF-based 3D reconstruction, enabling robust depth predictions even when hints are incomplete or absent. The persistent geometry is maintained via TSDF fusion and rendered on demand, with a training regime that exposes the model to varied hint availability and a two-pass evaluation to leverage full scene geometry. Empirically, the method achieves state-of-the-art depth and 3D reconstruction on ScanNetV2, 7-Scenes, and 3RScan, while delivering interactive runtimes and resilience to pose errors and scene changes, though it remains limited to observed geometry and faces challenges with transparent or reflective surfaces.

Abstract

Estimating depth from a sequence of posed RGB images is a fundamental computer vision task, with applications in augmented reality, path planning etc. Prior work typically makes use of previous frames in a multi view stereo framework, relying on matching textures in a local neighborhood. In contrast, our model leverages historical predictions by giving the latest 3D geometry data as an extra input to our network. This self-generated geometric hint can encode information from areas of the scene not covered by the keyframes and it is more regularized when compared to individual predicted depth maps for previous frames. We introduce a Hint MLP which combines cost volume features with a hint of the prior geometry, rendered as a depth map from the current camera location, together with a measure of the confidence in the prior geometry. We demonstrate that our method, which can run at interactive speeds, achieves state-of-the-art estimates of depth and 3D scene reconstruction in both offline and incremental evaluation scenarios.

Paper Structure

This paper contains 28 sections, 8 figures, 6 tables.

Figures (8)

  • Figure 1: A depth hint and confidence are rendered from a prior estimate of geometry and given as input to our method. This enables it to correctly predict the depth for ambiguous parts of the scene.
  • Figure 2: Method overview. Like other MVS methods, we take as input a sequence of RGB frames. Optionally, our network can additionally ingest a depth map rendered from the 3D representation of the scene built up so far, encoded in a TSDF volume. Alongside the rendered depth map, we include an estimate of how confident the global geometry is at each point, visualized as vertex colors (purple is higher confidence) on the \ref{['sec:sources_of_geometry']} mesh here. The depth predicted by our model is fused back into the TSDF to update the 3D geometry incrementally. When no such rendered depth map is available, our network gracefully falls back to our baseline model's performance.
  • Figure 3: Method detail. Our feature volume is reduced to a cost volume via a matching MLP. Our Hint MLP then combines the multi-view-stereo cost volume with an estimate of previously predicted geometry. For every location in the cost volume, the Hint MLP takes as input (i) the visual matching score, (ii) the geometry hint, formed as the absolute difference between the rendered depth hint and the depth plane at that cost volume position, and (iii) an estimate of the confidence of the hint at that pixel.
  • Figure 4: We introduce a more accurate mesh evaluation. (a) shows the ground truth mesh, which contains many holes. (b) shows an example predicted mesh, here from Stier_2023_ICCV. This is punished for being too complete, as bozic2021transformerfusion's visibility mask (c) extends beyond the ground truth, giving high Acc error on their prediction (d). Our new masking (e) is tighter to the ground truth mesh, giving a more meaningful error (f).
  • Figure 5: Qualitative depth results on ScanNetV2. All methods are run \ref{['sec:sources_of_geometry']}, where we run at interactive speeds only with access to previous frames. We compare with sayed2022simplerecon and duzceker2021deepvideomvs, the two closest-performing baselines. Our depth maps are more accurate with better small details (e.g. top) and overall geometry (e.g. bottom).
  • ...and 3 more figures