Table of Contents
Fetching ...

Scale-Invariant Monocular Depth Estimation via SSI Depth

S. Mahdi H. Miangoleh, Mahesh Reddy, Yağız Aksoy

TL;DR

This work tackles scale-invariant monocular depth estimation by introducing a two-stage pipeline that leverages scale-and-shift-invariant (SSI) depth as input to a scale-invariant (SI) network. A high-resolution SSI depth module produces two inputs, O^L and O^H, enabling detailed geometric reconstruction when combined with a synthetic-training regime and a novel sparse ordinal loss to sharpen depth discontinuities. Key contributions include a high-resolution SSI depth estimator, a compatible sparse ordinal loss, and a training recipe that stabilizes scale using the SSI input, achieving strong zero-shot performance on unseen in-the-wild data and enabling highly detailed 3D scene reconstructions for computational photography. The approach demonstrates that SSI inputs simplify the SI task sufficiently to generalize from synthetic data to real-world scenes, substantially advancing the practicality of SI depth maps for high-fidelity 3D rendering and editing.

Abstract

Existing methods for scale-invariant monocular depth estimation (SI MDE) often struggle due to the complexity of the task, and limited and non-diverse datasets, hindering generalizability in real-world scenarios. This is while shift-and-scale-invariant (SSI) depth estimation, simplifying the task and enabling training with abundant stereo datasets achieves high performance. We present a novel approach that leverages SSI inputs to enhance SI depth estimation, streamlining the network's role and facilitating in-the-wild generalization for SI depth estimation while only using a synthetic dataset for training. Emphasizing the generation of high-resolution details, we introduce a novel sparse ordinal loss that substantially improves detail generation in SSI MDE, addressing critical limitations in existing approaches. Through in-the-wild qualitative examples and zero-shot evaluation we substantiate the practical utility of our approach in computational photography applications, showcasing its ability to generate highly detailed SI depth maps and achieve generalization in diverse scenarios.

Scale-Invariant Monocular Depth Estimation via SSI Depth

TL;DR

This work tackles scale-invariant monocular depth estimation by introducing a two-stage pipeline that leverages scale-and-shift-invariant (SSI) depth as input to a scale-invariant (SI) network. A high-resolution SSI depth module produces two inputs, O^L and O^H, enabling detailed geometric reconstruction when combined with a synthetic-training regime and a novel sparse ordinal loss to sharpen depth discontinuities. Key contributions include a high-resolution SSI depth estimator, a compatible sparse ordinal loss, and a training recipe that stabilizes scale using the SSI input, achieving strong zero-shot performance on unseen in-the-wild data and enabling highly detailed 3D scene reconstructions for computational photography. The approach demonstrates that SSI inputs simplify the SI task sufficiently to generalize from synthetic data to real-world scenes, substantially advancing the practicality of SI depth maps for high-fidelity 3D rendering and editing.

Abstract

Existing methods for scale-invariant monocular depth estimation (SI MDE) often struggle due to the complexity of the task, and limited and non-diverse datasets, hindering generalizability in real-world scenarios. This is while shift-and-scale-invariant (SSI) depth estimation, simplifying the task and enabling training with abundant stereo datasets achieves high performance. We present a novel approach that leverages SSI inputs to enhance SI depth estimation, streamlining the network's role and facilitating in-the-wild generalization for SI depth estimation while only using a synthetic dataset for training. Emphasizing the generation of high-resolution details, we introduce a novel sparse ordinal loss that substantially improves detail generation in SSI MDE, addressing critical limitations in existing approaches. Through in-the-wild qualitative examples and zero-shot evaluation we substantiate the practical utility of our approach in computational photography applications, showcasing its ability to generate highly detailed SI depth maps and achieve generalization in diverse scenarios.
Paper Structure (19 sections, 7 equations, 5 figures, 5 tables)

This paper contains 19 sections, 7 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 2: Qualitative comparison of scale-invariant networks on the Middlebury dataset scharstein2014high. Our scale-invariant network exhibits superior performance in capturing intricate objects with higher levels of depth details compared to the state-of-the-art.
  • Figure 3: The plot of our ordinal loss and the ranking loss chen2016single. The ranking loss assigns a high penalty for correctly ordered pairs, while we only apply a penalty for incorrectly ordered pairs.
  • Figure 4: Figure depicts the in-the-wild performance of our model in accurately modeling the scene compared to LeRes leres. Our model can model the 3D shape of various scenes with different depth distributions at a high resolution and with precise boundary accuracy. As highlighted by the insets, the absence of details in LeReS causes geometrical distortions in the projected point clouds. Our accurate boundary localization enables precise shape representation, even for complex in-the-wild scenes. Image credits: Death to the Stock Photo
  • Figure 5: 3D point clouds generated by our SI-depth and LeRes leres from various views shows leveraging our crisp SSI depth, our SI depth produces finer details. This results in a more precise representation of shape compared to the less detailed and inaccurate results of LeRes. The missing details in LeReS leads to distortion and blending of the details into the background. (see flowers in the first row, monitors in the 2nd row, objects on the table in the 3rd row and the tray in the last row as emphasized by the insets.) Image credits: scharstein2014high, koch2018evaluation, Death to the Stock Photo
  • Figure 6: Qualitative comparison of scale and shift invariant networks in-the-wild reveals that our SSI network produces crisp depth boundaries compared to other methods. The results of our high-resolution boosted model exhibit even more refined depth boundaries. Image credits: https://unsplash.com/photos/white-concrete-building-with-fountain-bNEaIT3HIMk, https://unsplash.com/photos/a-cafe-with-a-brick-building-Kl3yDaIY8nk