Towards Ambiguity-Free Spatial Foundation Model: Rethinking and Decoupling Depth Ambiguity

Xiaohao Xu; Feng Xue; Xiang Li; Haowei Li; Shusheng Yang; Tianyi Zhang; Matthew Johnson-Roberson; Xiaonan Huang

Towards Ambiguity-Free Spatial Foundation Model: Rethinking and Decoupling Depth Ambiguity

Xiaohao Xu, Feng Xue, Xiang Li, Haowei Li, Shusheng Yang, Tianyi Zhang, Matthew Johnson-Roberson, Xiaonan Huang

TL;DR

This work tackles depth ambiguity in real-world 3D scenes by reframing monocular depth estimation as multi-hypothesis inference. It introduces Laplacian Visual Prompting (LVP), a training-free spectral prompting method, and MD-3k, the first benchmark for explicit multi-layer spatial relationships under ambiguity. Empirical results show LVP reveals latent depth hypotheses, modulates model depth biases, and enables robust geometry-conditioned generation, video-depth consistency, and spatial reasoning when combined with RGB cues. Collectively, these contributions point toward ambiguity-aware spatial foundation models with broad implications for safe, flexible 3D perception in visual AI systems.

Abstract

Depth ambiguity is a fundamental challenge in spatial scene understanding, especially in transparent scenes where single-depth estimates fail to capture full 3D structure. Existing models, limited to deterministic predictions, overlook real-world multi-layer depth. To address this, we introduce a paradigm shift from single-prediction to multi-hypothesis spatial foundation models. We first present \texttt{MD-3k}, a benchmark exposing depth biases in expert and foundational models through multi-layer spatial relationship labels and new metrics. To resolve depth ambiguity, we propose Laplacian Visual Prompting (LVP), a training-free spectral prompting technique that extracts hidden depth from pre-trained models via Laplacian-transformed RGB inputs. By integrating LVP-inferred depth with standard RGB-based estimates, our approach elicits multi-layer depth without model retraining. Extensive experiments validate the effectiveness of LVP in zero-shot multi-layer depth estimation, unlocking more robust and comprehensive geometry-conditioned visual generation, 3D-grounded spatial reasoning, and temporally consistent video-level depth inference. Our benchmark and code will be available at https://github.com/Xiaohao-Xu/Ambiguity-in-Space.

Towards Ambiguity-Free Spatial Foundation Model: Rethinking and Decoupling Depth Ambiguity

TL;DR

Abstract

Towards Ambiguity-Free Spatial Foundation Model: Rethinking and Decoupling Depth Ambiguity

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (29)