Table of Contents
Fetching ...

ViewpointDepth: A New Dataset for Monocular Depth Estimation Under Viewpoint Shifts

Aurel Pjetri, Stefano Caprasecca, Leonardo Taccari, Matteo Simoncini, Henrique Piñeiro Monteagudo, Wallace Walter, Douglas Coimbra de Andrade, Francesco Sambo, Andrew David Bagdanov

TL;DR

This work tackles monocular depth estimation under viewpoint shifts by introducing a homography-based ground-truth strategy paired with object-detection GT to estimate distances without LIDAR, formalized with $d=\sqrt{x^2+y^2+h^2}$. It also releases a new multi-view road-scene dataset collected with two calibrated dashcams across 10 viewpoints, enabling analysis of how camera position and orientation affect depth predictions. Ground-truth validity is demonstrated on KITTI, where the homography GT correlates strongly with LIDAR ($\text{Spearman}=0.97$) and achieves competitive abs-rel when compared to LIDAR-based GT. Experimental results with the MonoViT depth model reveal that certain viewpoint configurations, especially involving pitch with yaw or roll, significantly degrade performance, and that depth-scale distortion correlates with accuracy losses, suggesting inference-time scaling as a promising mitigation.

Abstract

Monocular depth estimation is a critical task for autonomous driving and many other computer vision applications. While significant progress has been made in this field, the effects of viewpoint shifts on depth estimation models remain largely underexplored. This paper introduces a novel dataset and evaluation methodology to quantify the impact of different camera positions and orientations on monocular depth estimation performance. We propose a ground truth strategy based on homography estimation and object detection, eliminating the need for expensive LIDAR sensors. We collect a diverse dataset of road scenes from multiple viewpoints and use it to assess the robustness of a modern depth estimation model to geometric shifts. After assessing the validity of our strategy on a public dataset, we provide valuable insights into the limitations of current models and highlight the importance of considering viewpoint variations in real-world applications.

ViewpointDepth: A New Dataset for Monocular Depth Estimation Under Viewpoint Shifts

TL;DR

This work tackles monocular depth estimation under viewpoint shifts by introducing a homography-based ground-truth strategy paired with object-detection GT to estimate distances without LIDAR, formalized with . It also releases a new multi-view road-scene dataset collected with two calibrated dashcams across 10 viewpoints, enabling analysis of how camera position and orientation affect depth predictions. Ground-truth validity is demonstrated on KITTI, where the homography GT correlates strongly with LIDAR () and achieves competitive abs-rel when compared to LIDAR-based GT. Experimental results with the MonoViT depth model reveal that certain viewpoint configurations, especially involving pitch with yaw or roll, significantly degrade performance, and that depth-scale distortion correlates with accuracy losses, suggesting inference-time scaling as a promising mitigation.

Abstract

Monocular depth estimation is a critical task for autonomous driving and many other computer vision applications. While significant progress has been made in this field, the effects of viewpoint shifts on depth estimation models remain largely underexplored. This paper introduces a novel dataset and evaluation methodology to quantify the impact of different camera positions and orientations on monocular depth estimation performance. We propose a ground truth strategy based on homography estimation and object detection, eliminating the need for expensive LIDAR sensors. We collect a diverse dataset of road scenes from multiple viewpoints and use it to assess the robustness of a modern depth estimation model to geometric shifts. After assessing the validity of our strategy on a public dataset, we provide valuable insights into the limitations of current models and highlight the importance of considering viewpoint variations in real-world applications.
Paper Structure (18 sections, 2 equations, 2 figures, 3 tables)

This paper contains 18 sections, 2 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: \ref{['fig:hom_a']} Manually labeled source points for the homography. \ref{['fig:hom_b']} Target points for the homography with metric distances.
  • Figure 2: \ref{['fig:box_a']} Bounding box of a vehicle with the point $\hat{X}$ used for GT distance. \ref{['fig:box_b']} Depth prediction. The original box was resized to $\alpha=75\%$ to extract the percentile $\beta$ of the prediction.