Scale-Invariant Monocular Depth Estimation via SSI Depth
S. Mahdi H. Miangoleh, Mahesh Reddy, Yağız Aksoy
TL;DR
This work tackles scale-invariant monocular depth estimation by introducing a two-stage pipeline that leverages scale-and-shift-invariant (SSI) depth as input to a scale-invariant (SI) network. A high-resolution SSI depth module produces two inputs, O^L and O^H, enabling detailed geometric reconstruction when combined with a synthetic-training regime and a novel sparse ordinal loss to sharpen depth discontinuities. Key contributions include a high-resolution SSI depth estimator, a compatible sparse ordinal loss, and a training recipe that stabilizes scale using the SSI input, achieving strong zero-shot performance on unseen in-the-wild data and enabling highly detailed 3D scene reconstructions for computational photography. The approach demonstrates that SSI inputs simplify the SI task sufficiently to generalize from synthetic data to real-world scenes, substantially advancing the practicality of SI depth maps for high-fidelity 3D rendering and editing.
Abstract
Existing methods for scale-invariant monocular depth estimation (SI MDE) often struggle due to the complexity of the task, and limited and non-diverse datasets, hindering generalizability in real-world scenarios. This is while shift-and-scale-invariant (SSI) depth estimation, simplifying the task and enabling training with abundant stereo datasets achieves high performance. We present a novel approach that leverages SSI inputs to enhance SI depth estimation, streamlining the network's role and facilitating in-the-wild generalization for SI depth estimation while only using a synthetic dataset for training. Emphasizing the generation of high-resolution details, we introduce a novel sparse ordinal loss that substantially improves detail generation in SSI MDE, addressing critical limitations in existing approaches. Through in-the-wild qualitative examples and zero-shot evaluation we substantiate the practical utility of our approach in computational photography applications, showcasing its ability to generate highly detailed SI depth maps and achieve generalization in diverse scenarios.
