Table of Contents
Fetching ...

Depth Map Prediction from a Single Image using a Multi-Scale Deep Network

David Eigen, Christian Puhrsch, Rob Fergus

TL;DR

This paper tackles monocular depth estimation by addressing global scale ambiguity and integrating global scene structure with local detail through a two-stack CNN. The coarse network predicts a global depth map, which is refined by a local network that also incorporates the coarse prediction, trained with a scale-invariant loss to emphasize depth relations. The approach achieves state-of-the-art results on NYU Depth v2 and KITTI, outperforming baselines on both scale-dependent and scale-invariant metrics and producing crisper depth boundaries. The work demonstrates that leveraging raw data distributions and a two-stage refinement yields robust depth predictions from single RGB images.

Abstract

Predicting depth is an essential component in understanding the 3D geometry of a scene. While for stereo images local correspondence suffices for estimation, finding depth relations from a single image is less straightforward, requiring integration of both global and local information from various cues. Moreover, the task is inherently ambiguous, with a large source of uncertainty coming from the overall scale. In this paper, we present a new method that addresses this task by employing two deep network stacks: one that makes a coarse global prediction based on the entire image, and another that refines this prediction locally. We also apply a scale-invariant error to help measure depth relations rather than scale. By leveraging the raw datasets as large sources of training data, our method achieves state-of-the-art results on both NYU Depth and KITTI, and matches detailed depth boundaries without the need for superpixelation.

Depth Map Prediction from a Single Image using a Multi-Scale Deep Network

TL;DR

This paper tackles monocular depth estimation by addressing global scale ambiguity and integrating global scene structure with local detail through a two-stack CNN. The coarse network predicts a global depth map, which is refined by a local network that also incorporates the coarse prediction, trained with a scale-invariant loss to emphasize depth relations. The approach achieves state-of-the-art results on NYU Depth v2 and KITTI, outperforming baselines on both scale-dependent and scale-invariant metrics and producing crisper depth boundaries. The work demonstrates that leveraging raw data distributions and a two-stage refinement yields robust depth predictions from single RGB images.

Abstract

Predicting depth is an essential component in understanding the 3D geometry of a scene. While for stereo images local correspondence suffices for estimation, finding depth relations from a single image is less straightforward, requiring integration of both global and local information from various cues. Moreover, the task is inherently ambiguous, with a large source of uncertainty coming from the overall scale. In this paper, we present a new method that addresses this task by employing two deep network stacks: one that makes a coarse global prediction based on the entire image, and another that refines this prediction locally. We also apply a scale-invariant error to help measure depth relations rather than scale. By leveraging the raw datasets as large sources of training data, our method achieves state-of-the-art results on both NYU Depth and KITTI, and matches detailed depth boundaries without the need for superpixelation.

Paper Structure

This paper contains 17 sections, 4 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Model architecture.
  • Figure 2: Weight vectors from layer Coarse 7 (coarse output), for ( a) KITTI and ( b) NYUDepth. Red is positive (farther) and blue is negative (closer); black is zero. Weights are selected uniformly and shown in descending order by $l_2$ norm. KITTI weights often show changes in depth on either side of the road. NYUDepth weights often show wall positions and doorways.
  • Figure 3: Qualitative comparison of Make3D, our method trained with $l_2$ loss ($\lambda=0$), and our method trained with both $l_2$ and scale-invariant loss ($\lambda=0.5$).
  • Figure 4: Example predictions from our algorithm. NYUDepth on left, KITTI on right. For each image, we show (a) input, (b) output of coarse network, (c) refined output of fine network, (d) ground truth. The fine scale network edits the coarse-scale input to better align with details such as object boundaries and wall edges. Examples are sorted from best (top) to worst (bottom).