DoubleTake: Geometry Guided Depth Estimation
Mohamed Sayed, Filippo Aleotti, Jamie Watson, Zawar Qureshi, Guillermo Garcia-Hernando, Gabriel Brostow, Sara Vicente, Michael Firman
TL;DR
This work tackles interactive depth estimation from sequences of posed RGB images by injecting prior geometric information into a depth predictor. It introduces a Hint MLP that fuses multi-view stereo cost volume with a rendered depth and a confidence map derived from a continually updated TSDF-based 3D reconstruction, enabling robust depth predictions even when hints are incomplete or absent. The persistent geometry is maintained via TSDF fusion and rendered on demand, with a training regime that exposes the model to varied hint availability and a two-pass evaluation to leverage full scene geometry. Empirically, the method achieves state-of-the-art depth and 3D reconstruction on ScanNetV2, 7-Scenes, and 3RScan, while delivering interactive runtimes and resilience to pose errors and scene changes, though it remains limited to observed geometry and faces challenges with transparent or reflective surfaces.
Abstract
Estimating depth from a sequence of posed RGB images is a fundamental computer vision task, with applications in augmented reality, path planning etc. Prior work typically makes use of previous frames in a multi view stereo framework, relying on matching textures in a local neighborhood. In contrast, our model leverages historical predictions by giving the latest 3D geometry data as an extra input to our network. This self-generated geometric hint can encode information from areas of the scene not covered by the keyframes and it is more regularized when compared to individual predicted depth maps for previous frames. We introduce a Hint MLP which combines cost volume features with a hint of the prior geometry, rendered as a depth map from the current camera location, together with a measure of the confidence in the prior geometry. We demonstrate that our method, which can run at interactive speeds, achieves state-of-the-art estimates of depth and 3D scene reconstruction in both offline and incremental evaluation scenarios.
