High-Precision Self-Supervised Monocular Depth Estimation with Rich-Resource Prior
Wencheng Han, Jianbing Shen
TL;DR
This work tackles the practicality gap in self-supervised monocular depth estimation by introducing Rich-resource Prior Depth (RPrDepth), which uses offline rich-resource priors to guide a low-resolution single-image depth estimator. The method employs a two-branch training pipeline with a ref-dataset of rich-resource data, a Prior Depth Fusion Module to fuse prior information, a Rich-resource Guided Loss to exploit pseudo-label guidance and viewpoint consistency, and an Attention Guided Feature Selection strategy to dramatically reduce the reference-data search space. Empirically, RPrDepth achieves state-of-the-art or competitive performance on KITTI Eigen Split, Make3D, and Cityscapes, outperforming strong baselines that rely on rich-resource inputs during inference while using only LR single-image inputs at test time. The approach improves robustness to moving objects and texture ambiguities by leveraging structured priors, making high-accuracy depth estimation more practical for real-world deployment.
Abstract
In the area of self-supervised monocular depth estimation, models that utilize rich-resource inputs, such as high-resolution and multi-frame inputs, typically achieve better performance than models that use ordinary single image input. However, these rich-resource inputs may not always be available, limiting the applicability of these methods in general scenarios. In this paper, we propose Rich-resource Prior Depth estimator (RPrDepth), which only requires single input image during the inference phase but can still produce highly accurate depth estimations comparable to rich resource based methods. Specifically, we treat rich-resource data as prior information and extract features from it as reference features in an offline manner. When estimating the depth for a single-image image, we search for similar pixels from the rich-resource features and use them as prior information to estimate the depth. Experimental results demonstrate that our model outperform other single-image model and can achieve comparable or even better performance than models with rich-resource inputs, only using low-resolution single-image input.
