OrchardDepth: Precise Metric Depth Estimation of Orchard Scene from Monocular Camera Images
Zhichao Zheng, Henry Williams, Bruce A MacDonald
TL;DR
OrchardDepth tackles the gap in monocular metric depth estimation for orchard/vineyard scenes by introducing a dense-sparse supervision framework that enforces consistency between a dense depth map and sparse LiDAR points. The method projects sparse depths into a mean canonical camera space, uses a ViT-Base encoder with a DPT decoder, and optimizes a composite loss that couples dense and sparse information via L_silog and L_con terms. Key contributions include the Mean Camera Space Transform, cross-dataset training on KITTI and orchard data, and a loss formulation with learnable weights that delivers a substantial RMSE improvement to $0.6738$ in orchard environments, signaling strong cross-domain generalization. This approach offers practical impact for agricultural robotics by enabling accurate depth perception from monocular cameras in rural settings, with potential to enhance obstacle avoidance and autonomous operation in orchards and vineyards.
Abstract
Monocular depth estimation is a rudimentary task in robotic perception. Recently, with the development of more accurate and robust neural network models and different types of datasets, monocular depth estimation has significantly improved performance and efficiency. However, most of the research in this area focuses on very concentrated domains. In particular, most of the benchmarks in outdoor scenarios belong to urban environments for the improvement of autonomous driving devices, and these benchmarks have a massive disparity with the orchard/vineyard environment, which is hardly helpful for research in the primary industry. Therefore, we propose OrchardDepth, which fills the gap in the estimation of the metric depth of the monocular camera in the orchard/vineyard environment. In addition, we present a new retraining method to improve the training result by monitoring the consistent regularization between dense depth maps and sparse points. Our method improves the RMSE of depth estimation in the orchard environment from 1.5337 to 0.6738, proving our method's validation.
