Table of Contents
Fetching ...

Photorealistic Phantom Roads in Real Scenes: Disentangling 3D Hallucinations from Physical Geometry

Hoang Nguyen, Xiaohao Xu, Xiaonan Huang

TL;DR

Monocular depth foundation models often hallucinate non-existent 3D structures when faced with perceptual ambiguity, posing safety risks for real-world perception. The authors introduce 3D-Mirage, a real-image benchmark with precise planar ROIs and context-restricted crops, and two Laplacian-based metrics (DCS and CCS) to quantify geometric hallucinations and contextual instability. They propose Grounded Self-Distillation using LoRA adapters to surgically enforce planarity in illusion ROIs while preserving background knowledge, achieving substantial reductions in both DCS and CCS with minimal parameter updates. Across experiments, this approach demonstrates robust taming of 3D mirages and preserves general depth understanding, highlighting the need to shift MDE evaluation toward structural and contextual robustness for safety-critical deployment.

Abstract

Monocular depth foundation models achieve remarkable generalization by learning large-scale semantic priors, but this creates a critical vulnerability: they hallucinate illusory 3D structures from geometrically planar but perceptually ambiguous inputs. We term this failure the 3D Mirage. This paper introduces the first end-to-end framework to probe, quantify, and tame this unquantified safety risk. To probe, we present 3D-Mirage, the first benchmark of real-world illusions (e.g., street art) with precise planar-region annotations and context-restricted crops. To quantify, we propose a Laplacian-based evaluation framework with two metrics: the Deviation Composite Score (DCS) for spurious non-planarity and the Confusion Composite Score (CCS) for contextual instability. To tame this failure, we introduce Grounded Self-Distillation, a parameter-efficient strategy that surgically enforces planarity on illusion ROIs while using a frozen teacher to preserve background knowledge, thus avoiding catastrophic forgetting. Our work provides the essential tools to diagnose and mitigate this phenomenon, urging a necessary shift in MDE evaluation from pixel-wise accuracy to structural and contextual robustness. Our code and benchmark will be publicly available to foster this exciting research direction.

Photorealistic Phantom Roads in Real Scenes: Disentangling 3D Hallucinations from Physical Geometry

TL;DR

Monocular depth foundation models often hallucinate non-existent 3D structures when faced with perceptual ambiguity, posing safety risks for real-world perception. The authors introduce 3D-Mirage, a real-image benchmark with precise planar ROIs and context-restricted crops, and two Laplacian-based metrics (DCS and CCS) to quantify geometric hallucinations and contextual instability. They propose Grounded Self-Distillation using LoRA adapters to surgically enforce planarity in illusion ROIs while preserving background knowledge, achieving substantial reductions in both DCS and CCS with minimal parameter updates. Across experiments, this approach demonstrates robust taming of 3D mirages and preserves general depth understanding, highlighting the need to shift MDE evaluation toward structural and contextual robustness for safety-critical deployment.

Abstract

Monocular depth foundation models achieve remarkable generalization by learning large-scale semantic priors, but this creates a critical vulnerability: they hallucinate illusory 3D structures from geometrically planar but perceptually ambiguous inputs. We term this failure the 3D Mirage. This paper introduces the first end-to-end framework to probe, quantify, and tame this unquantified safety risk. To probe, we present 3D-Mirage, the first benchmark of real-world illusions (e.g., street art) with precise planar-region annotations and context-restricted crops. To quantify, we propose a Laplacian-based evaluation framework with two metrics: the Deviation Composite Score (DCS) for spurious non-planarity and the Confusion Composite Score (CCS) for contextual instability. To tame this failure, we introduce Grounded Self-Distillation, a parameter-efficient strategy that surgically enforces planarity on illusion ROIs while using a frozen teacher to preserve background knowledge, thus avoiding catastrophic forgetting. Our work provides the essential tools to diagnose and mitigate this phenomenon, urging a necessary shift in MDE evaluation from pixel-wise accuracy to structural and contextual robustness. Our code and benchmark will be publicly available to foster this exciting research direction.

Paper Structure

This paper contains 27 sections, 7 equations, 26 figures, 5 tables.

Figures (26)

  • Figure 1: The 3D Mirage: Hallucinations induced by Illusive Phantom Road Patterns. (a) A driving scene featuring a deceptive phantom road pattern (3D illusion). (c) With full global context, the depth foundation model dav2 correctly identifies the road as planar. (d-f) However, when the view is restricted to the local region, the model fails to disambiguate the texture from geometry. It hallucinates significant non-existent 3D obstacles (f) from the phantom pattern, illustrating a critical vulnerability in reliable 3D perception for autonomous driving scenarios.
  • Figure 2: Hallucinations across SOTA monocular depth models on images. Given an optical-illusion region or a view with restricted context, all tested monocular depth foundation models (DAv2 dav2, Depth Pro Bochkovskii_2025_ICLR), Marigold Ke_2024_CVPR, DepthFM Gui_2025_AAAI, ZoeDepth zoe, MiDaS ranftl2022tpami, predict spurious depth variation.
  • Figure 3: Statistics of illusion regions in the 3D-Mirage dataset. Area distributions for illusion regions (left) and their corresponding random crops (right), as a percentage of the original image area. The dotted vertical line denotes the average value.
  • Figure 4: Overview of our Grounded Self-Distillation Pipeline. Our pipeline trains an Student model ($f_{\theta'}$) by injecting trainable LoRA adapters into the encoder of a frozen Teacher model ($f_\theta$). The system uses three streams to process an image containing a 3D illusion: (1) The Teacher Stream (top) processes the Full Image for a reference depth prediction; (2) The Student Full-Image Stream (middle) processes the same Full Image using student weights; and (3) The Student Crop Stream (bottom) processes an Image Crop of the illusion region, also with student weights. We optimize only the LoRA adapters with two key losses. First, a Non-Hallucination Knowledge Preservation ($\mathcal{L}_{\text{NKP}}$) loss aligns the student's background prediction (full image) with the teacher's stable prediction to prevent catastrophic forgetting. Second, a Hallucination Knowledge Re-editing ($\mathcal{L}_{\text{HKR}}$) loss uses self-distillation to force the student's full-image prediction of the illusion region to match its own, more accurate prediction from the context-free Image Crop stream. This process surgically re-edits the model's response to illusory cues while preserving its robust pre-trained knowledge.
  • Figure 5: Qualitative results of our Grounded Self-Distillation. Each row compares our model to the baseline on a 3D-Mirage sample. (1) Input RGB. (2) Error heatmap (Ours vs. Baseline), showing changes are confined to the ROI. (3) Baseline (DAv2-L) depth, which hallucinates 3D structures. (4) Our model's depth, which correctly perceives the planar surface. Our method tames the 3D mirage without distorting the background.
  • ...and 21 more figures