Table of Contents
Fetching ...

Zero-Shot 3D Visual Grounding from Vision-Language Models

Rong Li, Shijie Li, Lingdong Kong, Xulei Yang, Junwei Liang

TL;DR

The paper tackles open-world 3D visual grounding without 3D-specific training by exploiting 2D Vision-Language Models. It introduces SeeGround, a training-free framework that converts 3D scenes into query-aligned renderings paired with spatial textual descriptions, enabling cross-modal reasoning. Two key modules—Perspective Adaptation and Fusion Alignment—select viewpoints and align visual features with 3D spatial cues to reduce grounding ambiguity. Evaluations on ScanRefer and Nr3D show substantial improvements over zero-shot baselines and competitiveness with fully supervised methods, highlighting strong generalization in cluttered and partially described scenes.

Abstract

3D Visual Grounding (3DVG) seeks to locate target objects in 3D scenes using natural language descriptions, enabling downstream applications such as augmented reality and robotics. Existing approaches typically rely on labeled 3D data and predefined categories, limiting scalability to open-world settings. We present SeeGround, a zero-shot 3DVG framework that leverages 2D Vision-Language Models (VLMs) to bypass the need for 3D-specific training. To bridge the modality gap, we introduce a hybrid input format that pairs query-aligned rendered views with spatially enriched textual descriptions. Our framework incorporates two core components: a Perspective Adaptation Module that dynamically selects optimal viewpoints based on the query, and a Fusion Alignment Module that integrates visual and spatial signals to enhance localization precision. Extensive evaluations on ScanRefer and Nr3D confirm that SeeGround achieves substantial improvements over existing zero-shot baselines -- outperforming them by 7.7% and 7.1%, respectively -- and even rivals fully supervised alternatives, demonstrating strong generalization under challenging conditions.

Zero-Shot 3D Visual Grounding from Vision-Language Models

TL;DR

The paper tackles open-world 3D visual grounding without 3D-specific training by exploiting 2D Vision-Language Models. It introduces SeeGround, a training-free framework that converts 3D scenes into query-aligned renderings paired with spatial textual descriptions, enabling cross-modal reasoning. Two key modules—Perspective Adaptation and Fusion Alignment—select viewpoints and align visual features with 3D spatial cues to reduce grounding ambiguity. Evaluations on ScanRefer and Nr3D show substantial improvements over zero-shot baselines and competitiveness with fully supervised methods, highlighting strong generalization in cluttered and partially described scenes.

Abstract

3D Visual Grounding (3DVG) seeks to locate target objects in 3D scenes using natural language descriptions, enabling downstream applications such as augmented reality and robotics. Existing approaches typically rely on labeled 3D data and predefined categories, limiting scalability to open-world settings. We present SeeGround, a zero-shot 3DVG framework that leverages 2D Vision-Language Models (VLMs) to bypass the need for 3D-specific training. To bridge the modality gap, we introduce a hybrid input format that pairs query-aligned rendered views with spatially enriched textual descriptions. Our framework incorporates two core components: a Perspective Adaptation Module that dynamically selects optimal viewpoints based on the query, and a Fusion Alignment Module that integrates visual and spatial signals to enhance localization precision. Extensive evaluations on ScanRefer and Nr3D confirm that SeeGround achieves substantial improvements over existing zero-shot baselines -- outperforming them by 7.7% and 7.1%, respectively -- and even rivals fully supervised alternatives, demonstrating strong generalization under challenging conditions.

Paper Structure

This paper contains 11 sections, 7 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Effectiveness of SeeGround: Unlike previous state-of-the-art methods, our approach aligns 2D visual cues -- such as texture, shape, viewpoint, spatial position, orientation, state, and order -- with 3D spatial language to enable fine-grained scene comprehension. Specifically, our method: (a) texture: detects the floral chair by leveraging distinctive color and texture patterns; (b) shape: identifies the couch through its geometric shape; (c) viewpoint: localizes the correct window by analyzing spatial relations and camera perspective; (d) orientation: distinguishes the chair via directional alignment cues; (e) state: recognizes the closed door based on visual interpretation of object state; and (f) order: selects the bookshelf by reasoning about relative spatial placement.
  • Figure 2: Overview of the SeeGround framework. A 2D-VLM first interprets the query, identifying the target (e.g., "laptop") and an anchor (e.g., "chair with a floral pattern"). A dynamic viewpoint is selected based on the anchor’s position to render a query-aligned 2D image. Using the Object Lookup Table ($\mathcal{OLT}$), we retrieve 3D boxes, project visible ones, and apply visual prompts to reduce occlusion. The prompted image, spatial text, and query are fed into the 2D-VLM to localize the target. The predicted ID is then used to retrieve its 3D bounding box from $\mathcal{OLT}$.
  • Figure 3: Illustrative example of different perspective selection strategies. Our "Query-Aligned" method dynamically adapts the viewpoint to match the spatial context of the query, enhancing detail and relevance of visible objects compared to static methods.
  • Figure 4: Qualitative Results. Rendered scenes with model predictions: correct objects in Green, incorrect in Orange. Key visual cues (e.g., color, texture, spatial relations) are underlined to illustrate the model's reasoning.
  • Figure 5: Ablation study on (a) different projection strategies (ours vs. ZSVG3D zsvg3d), and (b) different language agents (GPT-4 openai2023gpt4vs. Qwen2-VL qwen2-vl).
  • ...and 2 more figures