Table of Contents
Fetching ...

Enhanced Encoder-Decoder Architecture for Accurate Monocular Depth Estimation

Dabbrata Das, Argho Deb Das, Farhan Sadaf

TL;DR

This paper addresses monocular depth estimation from a single image by proposing an encoder–decoder network that uses Inception-ResNet-v2 as the encoder to achieve multi-scale feature extraction. It introduces a composite loss combining depth, gradient-edge, and SSIM terms to balance accuracy and structural fidelity, achieving state-of-the-art results on NYU Depth V2 and highly efficient inference on KITTI (~0.019 s). The approach demonstrates strong indoor/outdoor generalization, with ARE ≈ 0.064, RMSE ≈ 0.228, and δ<1.25 ≈ 0.893 on NYU Depth V2, while maintaining a significant efficiency advantage over vision transformers, making it suitable for real-time applications. The work also provides ablation insights into layer-wise IRv2 features and loss-component contributions, and discusses deployment considerations and avenues for further reducing computational cost.

Abstract

Estimating depth from a single 2D image is a challenging task due to the lack of stereo or multi-view data, which are typically required for depth perception. In state-of-the-art architectures, the main challenge is to efficiently capture complex objects and fine-grained details, which are often difficult to predict. This paper introduces a novel deep learning-based approach using an enhanced encoder-decoder architecture, where the Inception-ResNet-v2 model serves as the encoder. This is the first instance of utilizing Inception-ResNet-v2 as an encoder for monocular depth estimation, demonstrating improved performance over previous models. It incorporates multi-scale feature extraction to enhance depth prediction accuracy across various object sizes and distances. We propose a composite loss function comprising depth loss, gradient edge loss, and Structural Similarity Index Measure (SSIM) loss, with fine-tuned weights to optimize the weighted sum, ensuring a balance across different aspects of depth estimation. Experimental results on the KITTI dataset show that our model achieves a significantly faster inference time of 0.019 seconds, outperforming vision transformers in efficiency while maintaining good accuracy. On the NYU Depth V2 dataset, the model establishes state-of-the-art performance, with an Absolute Relative Error (ARE) of 0.064, a Root Mean Square Error (RMSE) of 0.228, and an accuracy of 89.3% for $δ$ < 1.25. These metrics demonstrate that our model can accurately and efficiently predict depth even in challenging scenarios, providing a practical solution for real-time applications.

Enhanced Encoder-Decoder Architecture for Accurate Monocular Depth Estimation

TL;DR

This paper addresses monocular depth estimation from a single image by proposing an encoder–decoder network that uses Inception-ResNet-v2 as the encoder to achieve multi-scale feature extraction. It introduces a composite loss combining depth, gradient-edge, and SSIM terms to balance accuracy and structural fidelity, achieving state-of-the-art results on NYU Depth V2 and highly efficient inference on KITTI (~0.019 s). The approach demonstrates strong indoor/outdoor generalization, with ARE ≈ 0.064, RMSE ≈ 0.228, and δ<1.25 ≈ 0.893 on NYU Depth V2, while maintaining a significant efficiency advantage over vision transformers, making it suitable for real-time applications. The work also provides ablation insights into layer-wise IRv2 features and loss-component contributions, and discusses deployment considerations and avenues for further reducing computational cost.

Abstract

Estimating depth from a single 2D image is a challenging task due to the lack of stereo or multi-view data, which are typically required for depth perception. In state-of-the-art architectures, the main challenge is to efficiently capture complex objects and fine-grained details, which are often difficult to predict. This paper introduces a novel deep learning-based approach using an enhanced encoder-decoder architecture, where the Inception-ResNet-v2 model serves as the encoder. This is the first instance of utilizing Inception-ResNet-v2 as an encoder for monocular depth estimation, demonstrating improved performance over previous models. It incorporates multi-scale feature extraction to enhance depth prediction accuracy across various object sizes and distances. We propose a composite loss function comprising depth loss, gradient edge loss, and Structural Similarity Index Measure (SSIM) loss, with fine-tuned weights to optimize the weighted sum, ensuring a balance across different aspects of depth estimation. Experimental results on the KITTI dataset show that our model achieves a significantly faster inference time of 0.019 seconds, outperforming vision transformers in efficiency while maintaining good accuracy. On the NYU Depth V2 dataset, the model establishes state-of-the-art performance, with an Absolute Relative Error (ARE) of 0.064, a Root Mean Square Error (RMSE) of 0.228, and an accuracy of 89.3% for < 1.25. These metrics demonstrate that our model can accurately and efficiently predict depth even in challenging scenarios, providing a practical solution for real-time applications.

Paper Structure

This paper contains 27 sections, 6 equations, 14 figures, 8 tables, 3 algorithms.

Figures (14)

  • Figure 1: An outline of our network (encoder-decoder) architecture. The encoder uses a pre-trained Inception-ResNet-v2 (IRv2) networkarchitecture_encoder_IRv2 network, consisting of several Inception-ResNet blocks (A, B, and C) and reduction layers. The decoder consists of convolutional layers that process the upsampled output from the previous layer, combined with the corresponding feature maps from the encoder.
  • Figure 2: The process of generating a color map by mapping depth information from a grayscale image through an encoder-decoder network.
  • Figure 3: Schematic Representation of Inception-Resnet A (IR-A), Reduction A (R-A), Inception-Resnet B (IR-B), Reduction B (R-B), and Inception-Resnet C (IR-C) Block
  • Figure 4: Illustration of multi-scale feature extraction in Inception-ResNet v2, showcasing parallel branches with varying filter sizes and combining features across different scales.
  • Figure 5: Layer-by-layer feature map representation within the encoder-decoder network architecture, with Inception-ResNet-v2 (IRv2) as the encoder, designed for depth map generation.
  • ...and 9 more figures