Table of Contents
Fetching ...

Deeper Depth Prediction with Fully Convolutional Residual Networks

Iro Laina, Christian Rupprecht, Vasileios Belagiannis, Federico Tombari, Nassir Navab

TL;DR

The paper tackles monocular depth estimation from a single RGB image by introducing a fully convolutional residual network that uses novel up-projection blocks and fast upsampling to produce dense depth maps at higher resolution with significantly fewer parameters and training data. Depth prediction is optimized using the reverse Huber (BerHu) loss, which better handles the heavy-tailed depth distributions common in real-world scenes. Empirical results on NYU Depth v2 and Make3D demonstrate state-of-the-art accuracy and real-time performance, with ablations illustrating the benefits of up-projection and BerHu loss. The approach is also validated in a SLAM context, showing practical utility for real-time 3D reconstruction without post-processing refinements.

Abstract

This paper addresses the problem of estimating the depth map of a scene given a single RGB image. We propose a fully convolutional architecture, encompassing residual learning, to model the ambiguous mapping between monocular images and depth maps. In order to improve the output resolution, we present a novel way to efficiently learn feature map up-sampling within the network. For optimization, we introduce the reverse Huber loss that is particularly suited for the task at hand and driven by the value distributions commonly present in depth maps. Our model is composed of a single architecture that is trained end-to-end and does not rely on post-processing techniques, such as CRFs or other additional refinement steps. As a result, it runs in real-time on images or videos. In the evaluation, we show that the proposed model contains fewer parameters and requires fewer training data than the current state of the art, while outperforming all approaches on depth estimation. Code and models are publicly available.

Deeper Depth Prediction with Fully Convolutional Residual Networks

TL;DR

The paper tackles monocular depth estimation from a single RGB image by introducing a fully convolutional residual network that uses novel up-projection blocks and fast upsampling to produce dense depth maps at higher resolution with significantly fewer parameters and training data. Depth prediction is optimized using the reverse Huber (BerHu) loss, which better handles the heavy-tailed depth distributions common in real-world scenes. Empirical results on NYU Depth v2 and Make3D demonstrate state-of-the-art accuracy and real-time performance, with ablations illustrating the benefits of up-projection and BerHu loss. The approach is also validated in a SLAM context, showing practical utility for real-time 3D reconstruction without post-processing refinements.

Abstract

This paper addresses the problem of estimating the depth map of a scene given a single RGB image. We propose a fully convolutional architecture, encompassing residual learning, to model the ambiguous mapping between monocular images and depth maps. In order to improve the output resolution, we present a novel way to efficiently learn feature map up-sampling within the network. For optimization, we introduce the reverse Huber loss that is particularly suited for the task at hand and driven by the value distributions commonly present in depth maps. Our model is composed of a single architecture that is trained end-to-end and does not rely on post-processing techniques, such as CRFs or other additional refinement steps. As a result, it runs in real-time on images or videos. In the evaluation, we show that the proposed model contains fewer parameters and requires fewer training data than the current state of the art, while outperforming all approaches on depth estimation. Code and models are publicly available.

Paper Structure

This paper contains 15 sections, 1 equation, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Network architecture. The proposed architecture builds upon ResNet-50. We replace the fully-connected layer, which was part of the original architecture, with our novel up-sampling blocks, yielding an output of roughly half the input resolution
  • Figure 2: From up-convolutions to up-projections.(a) Standard up-convolution. (b) The equivalent but faster up-convolution. (c) Our novel up-projection block, following residual logic. (d) The faster equivalent version of (c)
  • Figure 3: Faster up-convolutions. Top row: the common up-convolutional steps: unpooling doubles a feature map's size, filling the holes with zeros, and a $5\times5$ convolution filters this map. Depending on the position of the filter, only certain parts of it (A,B,C,D) are multiplied with non-zero values. This motivates convolving the original feature map with the 4 differently composed filters (bottom part) and interleaving them to obtain the same output, while avoiding zero multiplications. A,B,C,D only mark locations and the actual weight values will differ
  • Figure 4: Depth Prediction on NYU Depth Qualitative results showing predictions using AlexNet, VGG, and the fully-connected ResNet compared to our model and the predictions of Eigen15. All colormaps are scaled equally for better comparison
  • Figure 5: Depth Prediction on Make3D. Displayed are RGB images (first row), ground truth depth maps (middle row) and our predictions (last row). Pixels that correspond to distances $>70m$ in the ground truth are masked out
  • ...and 1 more figures