Table of Contents
Fetching ...

OrchardDepth: Precise Metric Depth Estimation of Orchard Scene from Monocular Camera Images

Zhichao Zheng, Henry Williams, Bruce A MacDonald

TL;DR

OrchardDepth tackles the gap in monocular metric depth estimation for orchard/vineyard scenes by introducing a dense-sparse supervision framework that enforces consistency between a dense depth map and sparse LiDAR points. The method projects sparse depths into a mean canonical camera space, uses a ViT-Base encoder with a DPT decoder, and optimizes a composite loss that couples dense and sparse information via L_silog and L_con terms. Key contributions include the Mean Camera Space Transform, cross-dataset training on KITTI and orchard data, and a loss formulation with learnable weights that delivers a substantial RMSE improvement to $0.6738$ in orchard environments, signaling strong cross-domain generalization. This approach offers practical impact for agricultural robotics by enabling accurate depth perception from monocular cameras in rural settings, with potential to enhance obstacle avoidance and autonomous operation in orchards and vineyards.

Abstract

Monocular depth estimation is a rudimentary task in robotic perception. Recently, with the development of more accurate and robust neural network models and different types of datasets, monocular depth estimation has significantly improved performance and efficiency. However, most of the research in this area focuses on very concentrated domains. In particular, most of the benchmarks in outdoor scenarios belong to urban environments for the improvement of autonomous driving devices, and these benchmarks have a massive disparity with the orchard/vineyard environment, which is hardly helpful for research in the primary industry. Therefore, we propose OrchardDepth, which fills the gap in the estimation of the metric depth of the monocular camera in the orchard/vineyard environment. In addition, we present a new retraining method to improve the training result by monitoring the consistent regularization between dense depth maps and sparse points. Our method improves the RMSE of depth estimation in the orchard environment from 1.5337 to 0.6738, proving our method's validation.

OrchardDepth: Precise Metric Depth Estimation of Orchard Scene from Monocular Camera Images

TL;DR

OrchardDepth tackles the gap in monocular metric depth estimation for orchard/vineyard scenes by introducing a dense-sparse supervision framework that enforces consistency between a dense depth map and sparse LiDAR points. The method projects sparse depths into a mean canonical camera space, uses a ViT-Base encoder with a DPT decoder, and optimizes a composite loss that couples dense and sparse information via L_silog and L_con terms. Key contributions include the Mean Camera Space Transform, cross-dataset training on KITTI and orchard data, and a loss formulation with learnable weights that delivers a substantial RMSE improvement to in orchard environments, signaling strong cross-domain generalization. This approach offers practical impact for agricultural robotics by enabling accurate depth perception from monocular cameras in rural settings, with potential to enhance obstacle avoidance and autonomous operation in orchards and vineyards.

Abstract

Monocular depth estimation is a rudimentary task in robotic perception. Recently, with the development of more accurate and robust neural network models and different types of datasets, monocular depth estimation has significantly improved performance and efficiency. However, most of the research in this area focuses on very concentrated domains. In particular, most of the benchmarks in outdoor scenarios belong to urban environments for the improvement of autonomous driving devices, and these benchmarks have a massive disparity with the orchard/vineyard environment, which is hardly helpful for research in the primary industry. Therefore, we propose OrchardDepth, which fills the gap in the estimation of the metric depth of the monocular camera in the orchard/vineyard environment. In addition, we present a new retraining method to improve the training result by monitoring the consistent regularization between dense depth maps and sparse points. Our method improves the RMSE of depth estimation in the orchard environment from 1.5337 to 0.6738, proving our method's validation.

Paper Structure

This paper contains 13 sections, 12 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Scene Display - Illustrates the disparity between the orchard and the city scene. (Top) Image sample from KITTI depth dataset. (Bottom) Image captured in an Orchard from the US.
  • Figure 2: Data Preparation - An illustration of the calibration and projection of different sensors. (a) Combined LiDAR Points acquisition from three LiDAR sensors. (b) Image captured from the center camera. (c) Projected Combined LiDAR points to camera space
  • Figure 3: Pipeline - Top: The training process with the custom dataset, the depth ground truth is obtained from the combined points from three LiDAR sensors, and the points acquired from LiDAR sensors will be projected to the camera coordinate system and then transformed into the mean camera space. Bottom: In the training process with the KITTI dataset, we first generate dense depth with the stereo camera image. Then, we use the left image as the input to predict the depth map and convert it back to the acquisition camera coordinate, calculate the loss of predicted depth between dense depth and KITTI ground truth points, respectively, and then use $L_{con}$ to supervise the dense-sparse consistency.
  • Figure 4: Visual result of metric depth estimation. The images are in RGB, Predict SiLog Loss, and Predict Consistency Loss order. We can observe that the depth estimation results represent a difference. The depth predicted by the model train with consistency loss will make the depth between points smoother.