Table of Contents
Fetching ...

Depth Estimation using Weighted-loss and Transfer Learning

Muhammad Adeel Hafeez, Michael G. Madden, Ganesh Sistu, Ihsan Ullah

TL;DR

The paper tackles monocular depth estimation by leveraging transfer learning and a weighted loss that combines MAE, edge, and SSIM terms. The authors evaluate DenseNet and EfficientNet encoders within a uniform encoder-decoder framework and demonstrate that an EfficientNet encoder with a simple upsampling decoder yields the best RMSE, REL, and $log_{10}$ performance on NYU Depth Dataset v2, using a loss $L_{combined}$ tuned via grid/random searches. Key contributions include the optimized loss formulation, systematic encoder comparisons, and qualitative analyses showing robustness to ground-truth imperfections. The approach offers a practical, adaptable pathway to improve depth estimation accuracy with readily transferable pre-trained classifiers.

Abstract

Depth estimation from 2D images is a common computer vision task that has applications in many fields including autonomous vehicles, scene understanding and robotics. The accuracy of a supervised depth estimation method mainly relies on the chosen loss function, the model architecture, quality of data and performance metrics. In this study, we propose a simplified and adaptable approach to improve depth estimation accuracy using transfer learning and an optimized loss function. The optimized loss function is a combination of weighted losses to which enhance robustness and generalization: Mean Absolute Error (MAE), Edge Loss and Structural Similarity Index (SSIM). We use a grid search and a random search method to find optimized weights for the losses, which leads to an improved model. We explore multiple encoder-decoder-based models including DenseNet121, DenseNet169, DenseNet201, and EfficientNet for the supervised depth estimation model on NYU Depth Dataset v2. We observe that the EfficientNet model, pre-trained on ImageNet for classification when used as an encoder, with a simple upsampling decoder, gives the best results in terms of RSME, REL and log10: 0.386, 0.113 and 0.049, respectively. We also perform a qualitative analysis which illustrates that our model produces depth maps that closely resemble ground truth, even in cases where the ground truth is flawed. The results indicate significant improvements in accuracy and robustness, with EfficientNet being the most successful architecture.

Depth Estimation using Weighted-loss and Transfer Learning

TL;DR

The paper tackles monocular depth estimation by leveraging transfer learning and a weighted loss that combines MAE, edge, and SSIM terms. The authors evaluate DenseNet and EfficientNet encoders within a uniform encoder-decoder framework and demonstrate that an EfficientNet encoder with a simple upsampling decoder yields the best RMSE, REL, and performance on NYU Depth Dataset v2, using a loss tuned via grid/random searches. Key contributions include the optimized loss formulation, systematic encoder comparisons, and qualitative analyses showing robustness to ground-truth imperfections. The approach offers a practical, adaptable pathway to improve depth estimation accuracy with readily transferable pre-trained classifiers.

Abstract

Depth estimation from 2D images is a common computer vision task that has applications in many fields including autonomous vehicles, scene understanding and robotics. The accuracy of a supervised depth estimation method mainly relies on the chosen loss function, the model architecture, quality of data and performance metrics. In this study, we propose a simplified and adaptable approach to improve depth estimation accuracy using transfer learning and an optimized loss function. The optimized loss function is a combination of weighted losses to which enhance robustness and generalization: Mean Absolute Error (MAE), Edge Loss and Structural Similarity Index (SSIM). We use a grid search and a random search method to find optimized weights for the losses, which leads to an improved model. We explore multiple encoder-decoder-based models including DenseNet121, DenseNet169, DenseNet201, and EfficientNet for the supervised depth estimation model on NYU Depth Dataset v2. We observe that the EfficientNet model, pre-trained on ImageNet for classification when used as an encoder, with a simple upsampling decoder, gives the best results in terms of RSME, REL and log10: 0.386, 0.113 and 0.049, respectively. We also perform a qualitative analysis which illustrates that our model produces depth maps that closely resemble ground truth, even in cases where the ground truth is flawed. The results indicate significant improvements in accuracy and robustness, with EfficientNet being the most successful architecture.
Paper Structure (13 sections, 9 equations, 3 figures, 2 tables)

This paper contains 13 sections, 9 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Overview of the network. We implemented a simple encoder-decoder-based network with skip connections. We changed the encoder between different models while keeping the decoder constant. The depth maps produced at the output were 1/2X of the ground-truth maps.
  • Figure 2: Training and Validation Loss for EfficientNet (50 epochs).
  • Figure 3: The figure shows: (a) each original RGB image; (b) its ground-truth depth map; (c) the depth map predicted by DenseNet-169; (d) the depth map predicted by EfficientNet.