Table of Contents
Fetching ...

Towards Ambiguity-Free Spatial Foundation Model: Rethinking and Decoupling Depth Ambiguity

Xiaohao Xu, Feng Xue, Xiang Li, Haowei Li, Shusheng Yang, Tianyi Zhang, Matthew Johnson-Roberson, Xiaonan Huang

TL;DR

This work tackles depth ambiguity in real-world 3D scenes by reframing monocular depth estimation as multi-hypothesis inference. It introduces Laplacian Visual Prompting (LVP), a training-free spectral prompting method, and MD-3k, the first benchmark for explicit multi-layer spatial relationships under ambiguity. Empirical results show LVP reveals latent depth hypotheses, modulates model depth biases, and enables robust geometry-conditioned generation, video-depth consistency, and spatial reasoning when combined with RGB cues. Collectively, these contributions point toward ambiguity-aware spatial foundation models with broad implications for safe, flexible 3D perception in visual AI systems.

Abstract

Depth ambiguity is a fundamental challenge in spatial scene understanding, especially in transparent scenes where single-depth estimates fail to capture full 3D structure. Existing models, limited to deterministic predictions, overlook real-world multi-layer depth. To address this, we introduce a paradigm shift from single-prediction to multi-hypothesis spatial foundation models. We first present \texttt{MD-3k}, a benchmark exposing depth biases in expert and foundational models through multi-layer spatial relationship labels and new metrics. To resolve depth ambiguity, we propose Laplacian Visual Prompting (LVP), a training-free spectral prompting technique that extracts hidden depth from pre-trained models via Laplacian-transformed RGB inputs. By integrating LVP-inferred depth with standard RGB-based estimates, our approach elicits multi-layer depth without model retraining. Extensive experiments validate the effectiveness of LVP in zero-shot multi-layer depth estimation, unlocking more robust and comprehensive geometry-conditioned visual generation, 3D-grounded spatial reasoning, and temporally consistent video-level depth inference. Our benchmark and code will be available at https://github.com/Xiaohao-Xu/Ambiguity-in-Space.

Towards Ambiguity-Free Spatial Foundation Model: Rethinking and Decoupling Depth Ambiguity

TL;DR

This work tackles depth ambiguity in real-world 3D scenes by reframing monocular depth estimation as multi-hypothesis inference. It introduces Laplacian Visual Prompting (LVP), a training-free spectral prompting method, and MD-3k, the first benchmark for explicit multi-layer spatial relationships under ambiguity. Empirical results show LVP reveals latent depth hypotheses, modulates model depth biases, and enables robust geometry-conditioned generation, video-depth consistency, and spatial reasoning when combined with RGB cues. Collectively, these contributions point toward ambiguity-aware spatial foundation models with broad implications for safe, flexible 3D perception in visual AI systems.

Abstract

Depth ambiguity is a fundamental challenge in spatial scene understanding, especially in transparent scenes where single-depth estimates fail to capture full 3D structure. Existing models, limited to deterministic predictions, overlook real-world multi-layer depth. To address this, we introduce a paradigm shift from single-prediction to multi-hypothesis spatial foundation models. We first present \texttt{MD-3k}, a benchmark exposing depth biases in expert and foundational models through multi-layer spatial relationship labels and new metrics. To resolve depth ambiguity, we propose Laplacian Visual Prompting (LVP), a training-free spectral prompting technique that extracts hidden depth from pre-trained models via Laplacian-transformed RGB inputs. By integrating LVP-inferred depth with standard RGB-based estimates, our approach elicits multi-layer depth without model retraining. Extensive experiments validate the effectiveness of LVP in zero-shot multi-layer depth estimation, unlocking more robust and comprehensive geometry-conditioned visual generation, 3D-grounded spatial reasoning, and temporally consistent video-level depth inference. Our benchmark and code will be available at https://github.com/Xiaohao-Xu/Ambiguity-in-Space.

Paper Structure

This paper contains 23 sections, 10 equations, 29 figures, 3 tables.

Figures (29)

  • Figure 1: Motivation. 3D spatial understanding, powered by (a) sensors and (b) algorithms, has been confined to a biased single-layer representation of depth. (c) Existing methods collapse when faced with the true complexity of 3D, particularly in ambiguous scenes like those with transparency. (d) We propose Laplacian Visual Prompting (LVP) to transcend this limitation, granting Spatial Foundation Models the ability to derive multi-hypothesis depth, unlocking ambiguity-free spatial understanding.
  • Figure 2: Unlocking hidden depth with Laplacian Visual Prompting across diverse baselines depth_anythingyang2024depthmarigolddpt. Each case includes the RGB input, estimated depth from RGB, Laplacian input, estimated hidden depth from Laplacian, and an enhanced Laplacian. Notice that depth maps from RGB and LVP both capture plausible hypotheses: one for the transparent surface (glass) and another for the opaque object behind it.
  • Figure 3: MD-3k benchmark for evaluating multi-layer spatial relationships. Example images feature annotated ambiguous region masks and sparse point pairs with multi-layer spatial labels. The first and second spatial relation columns show ground truth near/far annotations (red and blue markers, respectively). The top three rows depict reverse relationships, while the bottom row shows a same relationship between layers.
  • Figure 4: Statistics of ambiguous regions in the MD-3k benchmark.Ratio of ambiguous regions to the whole image (Left) and spatial distribution of ambiguity regions (Right)
  • Figure 5: Multi-layer depth with Laplacian Visual Prompting (LVP). (a) Paired RGB-depth training of a domain-specific or domain-agnostic depth estimation model. (b) Standard inference via RGB input: single-layer depth on transparent glass. (c) Model inference via LVP: hidden depth revealing occluded objects, such as tables and chairs, behind the glass.
  • ...and 24 more figures