DepthCues: Evaluating Monocular Depth Perception in Large Vision Models
Duolikun Danier, Mehmet Aygün, Changjian Li, Hakan Bilen, Oisin Mac Aodha
TL;DR
This work investigates whether large, pre-trained vision models implicitly grasp human monocular depth cues without explicit depth supervision. It introduces DepthCues, a six-cue benchmark (elevation, light-shadow, occlusion, perspective, size, texture gradient) with datasets and probing protocols to evaluate cue understanding via frozen-model features. Across 20 diverse models, the study finds that depth-cue awareness tends to improve with model scale and newer pre-training methods, and that fine-tuning on DepthCues with LoRA can enhance downstream depth estimation even with sparse supervision. The findings highlight a pathway to boost depth perception in vision models by injecting human-like depth priors and provide a public benchmark to drive future research in depth-aware perception.
Abstract
Large-scale pre-trained vision models are becoming increasingly prevalent, offering expressive and generalizable visual representations that benefit various downstream tasks. Recent studies on the emergent properties of these models have revealed their high-level geometric understanding, in particular in the context of depth perception. However, it remains unclear how depth perception arises in these models without explicit depth supervision provided during pre-training. To investigate this, we examine whether the monocular depth cues, similar to those used by the human visual system, emerge in these models. We introduce a new benchmark, DepthCues, designed to evaluate depth cue understanding, and present findings across 20 diverse and representative pre-trained vision models. Our analysis shows that human-like depth cues emerge in more recent larger models. We also explore enhancing depth perception in large vision models by fine-tuning on DepthCues, and find that even without dense depth supervision, this improves depth estimation. To support further research, our benchmark and evaluation code will be made publicly available for studying depth perception in vision models.
