Enhanced Encoder-Decoder Architecture for Accurate Monocular Depth Estimation
Dabbrata Das, Argho Deb Das, Farhan Sadaf
TL;DR
This paper addresses monocular depth estimation from a single image by proposing an encoder–decoder network that uses Inception-ResNet-v2 as the encoder to achieve multi-scale feature extraction. It introduces a composite loss combining depth, gradient-edge, and SSIM terms to balance accuracy and structural fidelity, achieving state-of-the-art results on NYU Depth V2 and highly efficient inference on KITTI (~0.019 s). The approach demonstrates strong indoor/outdoor generalization, with ARE ≈ 0.064, RMSE ≈ 0.228, and δ<1.25 ≈ 0.893 on NYU Depth V2, while maintaining a significant efficiency advantage over vision transformers, making it suitable for real-time applications. The work also provides ablation insights into layer-wise IRv2 features and loss-component contributions, and discusses deployment considerations and avenues for further reducing computational cost.
Abstract
Estimating depth from a single 2D image is a challenging task due to the lack of stereo or multi-view data, which are typically required for depth perception. In state-of-the-art architectures, the main challenge is to efficiently capture complex objects and fine-grained details, which are often difficult to predict. This paper introduces a novel deep learning-based approach using an enhanced encoder-decoder architecture, where the Inception-ResNet-v2 model serves as the encoder. This is the first instance of utilizing Inception-ResNet-v2 as an encoder for monocular depth estimation, demonstrating improved performance over previous models. It incorporates multi-scale feature extraction to enhance depth prediction accuracy across various object sizes and distances. We propose a composite loss function comprising depth loss, gradient edge loss, and Structural Similarity Index Measure (SSIM) loss, with fine-tuned weights to optimize the weighted sum, ensuring a balance across different aspects of depth estimation. Experimental results on the KITTI dataset show that our model achieves a significantly faster inference time of 0.019 seconds, outperforming vision transformers in efficiency while maintaining good accuracy. On the NYU Depth V2 dataset, the model establishes state-of-the-art performance, with an Absolute Relative Error (ARE) of 0.064, a Root Mean Square Error (RMSE) of 0.228, and an accuracy of 89.3% for $δ$ < 1.25. These metrics demonstrate that our model can accurately and efficiently predict depth even in challenging scenarios, providing a practical solution for real-time applications.
