Table of Contents
Fetching ...

View-on-Graph: Zero-shot 3D Visual Grounding via Vision-Language Reasoning on Scene Graphs

Yuanyuan Liu, Haiyang Mei, Dongyang Zhan, Jiayue Zhao, Dongsheng Zhou, Bo Dong, Xin Yang

TL;DR

This work introduces VoG, a zero-shot 3D visual grounding framework that externalizes 3D spatial information into a multi-modal, multi-layer scene graph (MMMG) and enables a Vision-Language Model to actively traverse the graph. By reformulating 3DVG as an interactive reasoning process over MMMG, VoG reduces entangled input complexity and provides transparent grounding traces. Empirical results on ScanRefer and Nr3D show state-of-the-art zero-shot performance, with large-model variants achieving parity with supervised methods, while ablations demonstrate the necessity of MMMG structure, graph traversal, and multi-round reasoning. The approach offers interpretable, scalable 3D grounding and highlights a principled path toward integrating structured scene representations with powerful VLMs in robotics and AR contexts.

Abstract

3D visual grounding (3DVG) identifies objects in 3D scenes from language descriptions. Existing zero-shot approaches leverage 2D vision-language models (VLMs) by converting 3D spatial information (SI) into forms amenable to VLM processing, typically as composite inputs such as specified view renderings or video sequences with overlaid object markers. However, this VLM + SI paradigm yields entangled visual representations that compel the VLM to process entire cluttered cues, making it hard to exploit spatial semantic relationships effectively. In this work, we propose a new VLM x SI paradigm that externalizes the 3D SI into a form enabling the VLM to incrementally retrieve only what it needs during reasoning. We instantiate this paradigm with a novel View-on-Graph (VoG) method, which organizes the scene into a multi-modal, multi-layer scene graph and allows the VLM to operate as an active agent that selectively accesses necessary cues as it traverses the scene. This design offers two intrinsic advantages: (i) by structuring 3D context into a spatially and semantically coherent scene graph rather than confounding the VLM with densely entangled visual inputs, it lowers the VLM's reasoning difficulty; and (ii) by actively exploring and reasoning over the scene graph, it naturally produces transparent, step-by-step traces for interpretable 3DVG. Extensive experiments show that VoG achieves state-of-the-art zero-shot performance, establishing structured scene exploration as a promising strategy for advancing zero-shot 3DVG.

View-on-Graph: Zero-shot 3D Visual Grounding via Vision-Language Reasoning on Scene Graphs

TL;DR

This work introduces VoG, a zero-shot 3D visual grounding framework that externalizes 3D spatial information into a multi-modal, multi-layer scene graph (MMMG) and enables a Vision-Language Model to actively traverse the graph. By reformulating 3DVG as an interactive reasoning process over MMMG, VoG reduces entangled input complexity and provides transparent grounding traces. Empirical results on ScanRefer and Nr3D show state-of-the-art zero-shot performance, with large-model variants achieving parity with supervised methods, while ablations demonstrate the necessity of MMMG structure, graph traversal, and multi-round reasoning. The approach offers interpretable, scalable 3D grounding and highlights a principled path toward integrating structured scene representations with powerful VLMs in robotics and AR contexts.

Abstract

3D visual grounding (3DVG) identifies objects in 3D scenes from language descriptions. Existing zero-shot approaches leverage 2D vision-language models (VLMs) by converting 3D spatial information (SI) into forms amenable to VLM processing, typically as composite inputs such as specified view renderings or video sequences with overlaid object markers. However, this VLM + SI paradigm yields entangled visual representations that compel the VLM to process entire cluttered cues, making it hard to exploit spatial semantic relationships effectively. In this work, we propose a new VLM x SI paradigm that externalizes the 3D SI into a form enabling the VLM to incrementally retrieve only what it needs during reasoning. We instantiate this paradigm with a novel View-on-Graph (VoG) method, which organizes the scene into a multi-modal, multi-layer scene graph and allows the VLM to operate as an active agent that selectively accesses necessary cues as it traverses the scene. This design offers two intrinsic advantages: (i) by structuring 3D context into a spatially and semantically coherent scene graph rather than confounding the VLM with densely entangled visual inputs, it lowers the VLM's reasoning difficulty; and (ii) by actively exploring and reasoning over the scene graph, it naturally produces transparent, step-by-step traces for interpretable 3DVG. Extensive experiments show that VoG achieves state-of-the-art zero-shot performance, establishing structured scene exploration as a promising strategy for advancing zero-shot 3DVG.

Paper Structure

This paper contains 42 sections, 9 equations, 14 figures, 4 tables.

Figures (14)

  • Figure 1: Comparison of zero-shot 3D visual grounding paradigms. Left: Passive, fixed-view processing paradigm. When the anchored-view observation misses the target, the VLM’s reasoning over incomplete or misleading visual cues leads to failure. Right: Active, iterative exploration via our View-on-Graph (VoG) paradigm. By traversing the scene graph to navigate from misleading views toward informative observations, the VLM accurately locates the target and produces interpretable grounding traces.
  • Figure 2: Workflow of VoG. Given the query, VoG identifies the target category (chair) and anchor (TV) as the search topic, then randomly selects an initial view where a chair is visible (View 0). From this view, it traverses neighboring nodes to form candidate nodes, which are fed to the VLM to decide whether further exploration is needed. The first exploration area reveals the "brighter side of the room" in the query, where multiple chairs are observed and added to the object pool. However, the anchor TV is missing, leaving the target unclear. The VLM then explores an area where the TV becomes visible (STEP 2) and observes "the first chair counted from the TV," adding chair-1 to the pool. Ambiguity remains due to another nearby chair, prompting a final exploration to capture the full spatial layout. Once all query cues, including "faces the brighter side of the room" are confirmed, the VLM identifies chair-1 as the target and terminates the search.
  • Figure 3: Traceable reasoning paths of VoG. Starting from an initial view showing a visually similar but mismatched chair, VoG rejects it due to missing the "tucked beneath the desk" cue. It then explores related scene elements (desk) and progressively refines its hypothesis. The search converges when both appearance and spatial state match, resulting in correct grounding.
  • Figure 4: Qualitative grounding results. (Top) The initial view reveals a brown chair matching part of the query description ("copier machine to the right"), but VoG explores further to verify spatial cues (“beside another similar chair”) before confirming the target. (Bottom) Starting from "at the table", VoG explores to confirm "nearest to the refrigerator" before final grounding.
  • Figure 5: Graph Structure Ablation. S1: Full structure. S2–S4: Remove one type of edge while keeping others. S5*: Keep only images input with global object IDs. S6: Keep only text input. S7: Remove all structure. The first row shows the graph structure configurations and the corresponding accuracy.
  • ...and 9 more figures