Table of Contents
Fetching ...

Accuracy Does Not Guarantee Human-Likeness in Monocular Depth Estimators

Yuki Kubota, Taiki Fukiage

TL;DR

This work questions the assumption that higher accuracy in monocular depth estimation yields more human-like perception. By collecting absolute-depth judgments from humans on KITTI and evaluating 69 diverse DNNs, the authors show an inverse-U relationship: accuracy increases human-likeness up to near-human levels, after which further gains diverge from human biases. An affine decomposition reveals consistent bias patterns across humans and models, indicating shared perceptual priors, yet high-performing models often adopt strategies that differ from human vision. The study argues for multidimensional, human-centered evaluation to assess robustness and interpretability in outdoor 3D vision systems and provides datasets and code to catalyze further research.

Abstract

Monocular depth estimation is a fundamental capability for real-world applications such as autonomous driving and robotics. Although deep neural networks (DNNs) have achieved superhuman accuracy on physical-based benchmarks, a key challenge remains: aligning model representations with human perception, a promising strategy for enhancing model robustness and interpretability. Research in object recognition has revealed a complex trade-off between model accuracy and human-like behavior, raising a question whether a similar divergence exist in depth estimation, particularly for natural outdoor scenes where benchmarks rely on sensor-based ground truth rather than human perceptual estimates. In this study, we systematically investigated the relationship between model accuracy and human similarity across 69 monocular depth estimators using the KITTI dataset. To dissect the structure of error patterns on a factor-by-factor basis, we applied affine fitting to decompose prediction errors into interpretable components. Intriguingly, our results reveal while humans and DNNs share certain estimation biases (positive error correlations), we observed distinct trade-off relationships between model accuracy and human similarity. This finding indicates that improving accuracy does not necessarily lead to more human-like behavior, underscoring the necessity of developing multifaceted, human-centric evaluations beyond traditional accuracy.

Accuracy Does Not Guarantee Human-Likeness in Monocular Depth Estimators

TL;DR

This work questions the assumption that higher accuracy in monocular depth estimation yields more human-like perception. By collecting absolute-depth judgments from humans on KITTI and evaluating 69 diverse DNNs, the authors show an inverse-U relationship: accuracy increases human-likeness up to near-human levels, after which further gains diverge from human biases. An affine decomposition reveals consistent bias patterns across humans and models, indicating shared perceptual priors, yet high-performing models often adopt strategies that differ from human vision. The study argues for multidimensional, human-centered evaluation to assess robustness and interpretability in outdoor 3D vision systems and provides datasets and code to catalyze further research.

Abstract

Monocular depth estimation is a fundamental capability for real-world applications such as autonomous driving and robotics. Although deep neural networks (DNNs) have achieved superhuman accuracy on physical-based benchmarks, a key challenge remains: aligning model representations with human perception, a promising strategy for enhancing model robustness and interpretability. Research in object recognition has revealed a complex trade-off between model accuracy and human-like behavior, raising a question whether a similar divergence exist in depth estimation, particularly for natural outdoor scenes where benchmarks rely on sensor-based ground truth rather than human perceptual estimates. In this study, we systematically investigated the relationship between model accuracy and human similarity across 69 monocular depth estimators using the KITTI dataset. To dissect the structure of error patterns on a factor-by-factor basis, we applied affine fitting to decompose prediction errors into interpretable components. Intriguingly, our results reveal while humans and DNNs share certain estimation biases (positive error correlations), we observed distinct trade-off relationships between model accuracy and human similarity. This finding indicates that improving accuracy does not necessarily lead to more human-like behavior, underscoring the necessity of developing multifaceted, human-centric evaluations beyond traditional accuracy.

Paper Structure

This paper contains 25 sections, 12 figures, 2 tables.

Figures (12)

  • Figure 1: Overview of our research. We first constructed a human depth evaluation dataset based on an absolute depth estimation task. Next, we recruited 69 pre-trained deep neural networks (DNNs) for monocular depth estimation and compared their depth estimation errors with human evaluations using partial correlation analysis. We found trade-off relationships between model accuracy (horizontal axis) and human similarity of error patterns (vertical axis). To further disentangle the sources of these errors, we conducted an affine decomposition, evaluating systematic, per-image distortions of humans and DNNs ($^{*}: p<0.01$, $^{**}: p<0.01$, and $^{***}: p<0.001$).
  • Figure 2: Comparison of human and DNN error profiles for absolute (left) and scale-recovered (right) data. (A) RMSE of raw errors against the physical ground-truth depth. (B) Pearson partial correlations used to quantify similarity between humans and DNNs.
  • Figure 3: Scatter plots showing the relationship between model accuracy, quantified by scale-shift invariant RMSE, and human similarity for different components of the affine model. The analysis is presented separately for (A) 36 models that predict absolute depth values and (B) 69 models that predict scale-invariant depth ($^{*}: p<0.01$, $^{**}: p<0.01$, and $^{***}: p<0.001$).
  • Figure 4: Comparison of human and DNN depth estimation biases, illustrated with examples of characteristic horizontal ($a_x$) and vertical ($a_y$) shear patterns in human depth estimates. The top two rows show images with the largest positive and negative $a_x$ values ($6.67$ and $-6.55$), while the bottom two rows show depict images with the minimum and maximum $a_y$ values ($-5.54$ and $-37.7$).
  • Figure B1: Affine coefficients and residual error for humans and 36 DNNs that output absolute depth. The figure consists of five subplots: scale component ($a_\textrm{z}$), shift component ($b$), horizontal shear component ($a_\textrm{x}$), vertical shear component ($a_\textrm{y}$), and the residual error component.
  • ...and 7 more figures