Photorealistic Phantom Roads in Real Scenes: Disentangling 3D Hallucinations from Physical Geometry
Hoang Nguyen, Xiaohao Xu, Xiaonan Huang
TL;DR
Monocular depth foundation models often hallucinate non-existent 3D structures when faced with perceptual ambiguity, posing safety risks for real-world perception. The authors introduce 3D-Mirage, a real-image benchmark with precise planar ROIs and context-restricted crops, and two Laplacian-based metrics (DCS and CCS) to quantify geometric hallucinations and contextual instability. They propose Grounded Self-Distillation using LoRA adapters to surgically enforce planarity in illusion ROIs while preserving background knowledge, achieving substantial reductions in both DCS and CCS with minimal parameter updates. Across experiments, this approach demonstrates robust taming of 3D mirages and preserves general depth understanding, highlighting the need to shift MDE evaluation toward structural and contextual robustness for safety-critical deployment.
Abstract
Monocular depth foundation models achieve remarkable generalization by learning large-scale semantic priors, but this creates a critical vulnerability: they hallucinate illusory 3D structures from geometrically planar but perceptually ambiguous inputs. We term this failure the 3D Mirage. This paper introduces the first end-to-end framework to probe, quantify, and tame this unquantified safety risk. To probe, we present 3D-Mirage, the first benchmark of real-world illusions (e.g., street art) with precise planar-region annotations and context-restricted crops. To quantify, we propose a Laplacian-based evaluation framework with two metrics: the Deviation Composite Score (DCS) for spurious non-planarity and the Confusion Composite Score (CCS) for contextual instability. To tame this failure, we introduce Grounded Self-Distillation, a parameter-efficient strategy that surgically enforces planarity on illusion ROIs while using a frozen teacher to preserve background knowledge, thus avoiding catastrophic forgetting. Our work provides the essential tools to diagnose and mitigate this phenomenon, urging a necessary shift in MDE evaluation from pixel-wise accuracy to structural and contextual robustness. Our code and benchmark will be publicly available to foster this exciting research direction.
