Table of Contents
Fetching ...

UniGround: Universal 3D Visual Grounding via Training-Free Scene Parsing

Jiaxi Zhang, Yunheng Wang, Wei Lu, Taowen Wang, Weisheng Xu, Shuning Zhang, Yixiao Feng, Yuetong Fang, Renjing Xu

TL;DR

The proposed UniGround operates in two stages: a Global Candidate Filtering stage that constructs scene candidates through training-free 3D topology and multi-view semantic encoding, and a Local Precision Grounding stage that leverages multi-scale visual prompting and structured reasoning to precisely identify the target object.

Abstract

Understanding and localizing objects in complex 3D environments from natural language descriptions, known as 3D Visual Grounding (3DVG), is a foundational challenge in embodied AI, with broad implications for robotics, augmented reality, and human-machine interaction. Large-scale pre-trained foundation models have driven significant progress on this front, enabling open-vocabulary 3DVG that allows systems to locate arbitrary objects in a given scene. However, their reliance on pre-trained models constrains 3D perception and reasoning within the inherited knowledge boundaries, resulting in limited generalization to unseen spatial relationships and poor robustness to out-of-distribution scenes. In this paper, we replace this constrained perception with training-free visual and geometric reasoning, thereby unlocking open-world 3DVG that enables the localization of any object in any scene beyond the training data. Specifically, the proposed UniGround operates in two stages: a Global Candidate Filtering stage that constructs scene candidates through training-free 3D topology and multi-view semantic encoding, and a Local Precision Grounding stage that leverages multi-scale visual prompting and structured reasoning to precisely identify the target object. Experiments on ScanRefer and EmbodiedScan show that UniGround achieves 46.1\%/34.1\% Acc@0.25/0.5 on ScanRefer and 28.7\% Acc@0.25 on EmbodiedScan, establishing a new state-of-the-art among zero-shot methods on EmbodiedScan without any 3D supervision. We further evaluate UniGround in real-world environments under uncontrolled reconstruction conditions and substantial domain shift, showing training-free reasoning generalizes robustly beyond curated benchmarks.

UniGround: Universal 3D Visual Grounding via Training-Free Scene Parsing

TL;DR

The proposed UniGround operates in two stages: a Global Candidate Filtering stage that constructs scene candidates through training-free 3D topology and multi-view semantic encoding, and a Local Precision Grounding stage that leverages multi-scale visual prompting and structured reasoning to precisely identify the target object.

Abstract

Understanding and localizing objects in complex 3D environments from natural language descriptions, known as 3D Visual Grounding (3DVG), is a foundational challenge in embodied AI, with broad implications for robotics, augmented reality, and human-machine interaction. Large-scale pre-trained foundation models have driven significant progress on this front, enabling open-vocabulary 3DVG that allows systems to locate arbitrary objects in a given scene. However, their reliance on pre-trained models constrains 3D perception and reasoning within the inherited knowledge boundaries, resulting in limited generalization to unseen spatial relationships and poor robustness to out-of-distribution scenes. In this paper, we replace this constrained perception with training-free visual and geometric reasoning, thereby unlocking open-world 3DVG that enables the localization of any object in any scene beyond the training data. Specifically, the proposed UniGround operates in two stages: a Global Candidate Filtering stage that constructs scene candidates through training-free 3D topology and multi-view semantic encoding, and a Local Precision Grounding stage that leverages multi-scale visual prompting and structured reasoning to precisely identify the target object. Experiments on ScanRefer and EmbodiedScan show that UniGround achieves 46.1\%/34.1\% Acc@0.25/0.5 on ScanRefer and 28.7\% Acc@0.25 on EmbodiedScan, establishing a new state-of-the-art among zero-shot methods on EmbodiedScan without any 3D supervision. We further evaluate UniGround in real-world environments under uncontrolled reconstruction conditions and substantial domain shift, showing training-free reasoning generalizes robustly beyond curated benchmarks.
Paper Structure (14 sections, 2 equations, 6 figures, 3 tables)

This paper contains 14 sections, 2 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Conceptual comparison and overview of UniGround . Top: Existing zero-shot 3D visual grounding pipelines typically construct candidate objects using dataset-trained 3D detectors with prior information, which may degrade under domain shift and affect downstream reasoning quality. Bottom:UniGround adopts a training-free paradigm consisting of Global Candidate Filtering for 2D--3D instance construction and open-vocabulary encoding, followed by the Local Precision Grounding for geometric-aware multi-view reasoning with a structured and parsable output. Right: Example of real-world deployment in an unseen office scene.
  • Figure 2: Overview of UniGround . Given raw scene observations and a language query, UniGround performs Stage 1: Global Candidate Filtering to retrieve a compact candidate set, followed by Stage 2: Local Precision Grounding, which localizes the target by jointly reasoning over global scene context for spatial verification and candidate-centric visual evidence for semantic identification.
  • Figure 3: Orbit rendering and candidate views.
  • Figure 5: Qualitative comparison on the cross-dataset ARKitScene benchmark. Purple boxes denote failures and green boxes denote correct grounding.
  • Figure 6: Real-world qualitative results in unseen indoor scenes under capture noise and layout shift.
  • ...and 1 more figures