Table of Contents
Fetching ...

DepthCues: Evaluating Monocular Depth Perception in Large Vision Models

Duolikun Danier, Mehmet Aygün, Changjian Li, Hakan Bilen, Oisin Mac Aodha

TL;DR

This work investigates whether large, pre-trained vision models implicitly grasp human monocular depth cues without explicit depth supervision. It introduces DepthCues, a six-cue benchmark (elevation, light-shadow, occlusion, perspective, size, texture gradient) with datasets and probing protocols to evaluate cue understanding via frozen-model features. Across 20 diverse models, the study finds that depth-cue awareness tends to improve with model scale and newer pre-training methods, and that fine-tuning on DepthCues with LoRA can enhance downstream depth estimation even with sparse supervision. The findings highlight a pathway to boost depth perception in vision models by injecting human-like depth priors and provide a public benchmark to drive future research in depth-aware perception.

Abstract

Large-scale pre-trained vision models are becoming increasingly prevalent, offering expressive and generalizable visual representations that benefit various downstream tasks. Recent studies on the emergent properties of these models have revealed their high-level geometric understanding, in particular in the context of depth perception. However, it remains unclear how depth perception arises in these models without explicit depth supervision provided during pre-training. To investigate this, we examine whether the monocular depth cues, similar to those used by the human visual system, emerge in these models. We introduce a new benchmark, DepthCues, designed to evaluate depth cue understanding, and present findings across 20 diverse and representative pre-trained vision models. Our analysis shows that human-like depth cues emerge in more recent larger models. We also explore enhancing depth perception in large vision models by fine-tuning on DepthCues, and find that even without dense depth supervision, this improves depth estimation. To support further research, our benchmark and evaluation code will be made publicly available for studying depth perception in vision models.

DepthCues: Evaluating Monocular Depth Perception in Large Vision Models

TL;DR

This work investigates whether large, pre-trained vision models implicitly grasp human monocular depth cues without explicit depth supervision. It introduces DepthCues, a six-cue benchmark (elevation, light-shadow, occlusion, perspective, size, texture gradient) with datasets and probing protocols to evaluate cue understanding via frozen-model features. Across 20 diverse models, the study finds that depth-cue awareness tends to improve with model scale and newer pre-training methods, and that fine-tuning on DepthCues with LoRA can enhance downstream depth estimation even with sparse supervision. The findings highlight a pathway to boost depth perception in vision models by injecting human-like depth priors and provide a public benchmark to drive future research in depth-aware perception.

Abstract

Large-scale pre-trained vision models are becoming increasingly prevalent, offering expressive and generalizable visual representations that benefit various downstream tasks. Recent studies on the emergent properties of these models have revealed their high-level geometric understanding, in particular in the context of depth perception. However, it remains unclear how depth perception arises in these models without explicit depth supervision provided during pre-training. To investigate this, we examine whether the monocular depth cues, similar to those used by the human visual system, emerge in these models. We introduce a new benchmark, DepthCues, designed to evaluate depth cue understanding, and present findings across 20 diverse and representative pre-trained vision models. Our analysis shows that human-like depth cues emerge in more recent larger models. We also explore enhancing depth perception in large vision models by fine-tuning on DepthCues, and find that even without dense depth supervision, this improves depth estimation. To support further research, our benchmark and evaluation code will be made publicly available for studying depth perception in vision models.

Paper Structure

This paper contains 31 sections, 4 equations, 20 figures, 8 tables.

Figures (20)

  • Figure 1: Human-like monocular depth cues emerge in large vision models. We present DepthCues, a comprehensive benchmark suite designed to probe the understanding of human monocular depth cues in vision models. We analyse a diverse set of vision models and find that recent self-supervised and geometry estimation models demonstrate a notably stronger grasp of these cues, even in cases where models (e.g., DINOv2) were not explicitly pre-trained on any depth-related tasks.
  • Figure 2: Overview of DepthCues. Monocular depth cues, the associated tasks, and example instances from our proposed benchmark.
  • Figure 3: DepthCues Benchmark Results. We evaluate 20 vision models with diverse pre-training settings (indicated by color) on the DepthCues benchmark, which assesses six different monocular depth cues (each row) ubiquitous to humans. The models are ranked based on their average performance on the six cues. We include an end-to-end trained baseline (blue dotted line) as an oracle and a trivial baseline (red dotted line) to mark floor performance. Additionally, depth estimation linear probing results on NYUv2 are shown on the bottom row.
  • Figure 4: Task correlation. We measure the Spearman Ranked-order Correlation between each pair of tasks in DepthCues benchmark, and how they correlate with depth estimation and image classification performance of models. A correlation score of one indicates the same ranking of models for two tasks.
  • Figure A1: Performance of vision models on DepthCues vs. NYUv2 depth estimation. A strong correlation is observed between depth cue understanding and depth estimation.
  • ...and 15 more figures