Table of Contents
Fetching ...

ReasonGrounder: LVLM-Guided Hierarchical Feature Splatting for Open-Vocabulary 3D Visual Grounding and Reasoning

Zhenyang Liu, Yikai Wang, Sixiao Zheng, Tongying Pan, Longfei Liang, Yanwei Fu, Xiangyang Xue

TL;DR

This work tackles open-vocabulary 3D grounding and reasoning under occlusion by introducing ReasonGrounder, an LVLM-guided framework that leverages scale-hierarchical 3D Gaussian fields and 3D Gaussian Splatting. It integrates SAM-derived masks, CLIP supervision, and LVLM reasoning to select Gaussian groups and achieve accurate, amodal object localization from novel viewpoints without heavy 3D annotations. A key contribution is the ReasoningGD dataset, with over 10K scenes and ~2 million annotations, enabling robust evaluation of implicit instructions and occlusion handling. Experiments show that ReasonGrounder outperforms state-of-the-art open-vocabulary 3D grounding methods in both localization accuracy andIoU, while also supporting complex reasoning and amodal perception with novel views, which is significant for vision-language navigation and robotics.

Abstract

Open-vocabulary 3D visual grounding and reasoning aim to localize objects in a scene based on implicit language descriptions, even when they are occluded. This ability is crucial for tasks such as vision-language navigation and autonomous robotics. However, current methods struggle because they rely heavily on fine-tuning with 3D annotations and mask proposals, which limits their ability to handle diverse semantics and common knowledge required for effective reasoning. In this work, we propose ReasonGrounder, an LVLM-guided framework that uses hierarchical 3D feature Gaussian fields for adaptive grouping based on physical scale, enabling open-vocabulary 3D grounding and reasoning. ReasonGrounder interprets implicit instructions using large vision-language models (LVLM) and localizes occluded objects through 3D Gaussian splatting. By incorporating 2D segmentation masks from the SAM and multi-view CLIP embeddings, ReasonGrounder selects Gaussian groups based on object scale, enabling accurate localization through both explicit and implicit language understanding, even in novel, occluded views. We also contribute ReasoningGD, a new dataset containing over 10K scenes and 2 million annotations for evaluating open-vocabulary 3D grounding and amodal perception under occlusion. Experiments show that ReasonGrounder significantly improves 3D grounding accuracy in real-world scenarios.

ReasonGrounder: LVLM-Guided Hierarchical Feature Splatting for Open-Vocabulary 3D Visual Grounding and Reasoning

TL;DR

This work tackles open-vocabulary 3D grounding and reasoning under occlusion by introducing ReasonGrounder, an LVLM-guided framework that leverages scale-hierarchical 3D Gaussian fields and 3D Gaussian Splatting. It integrates SAM-derived masks, CLIP supervision, and LVLM reasoning to select Gaussian groups and achieve accurate, amodal object localization from novel viewpoints without heavy 3D annotations. A key contribution is the ReasoningGD dataset, with over 10K scenes and ~2 million annotations, enabling robust evaluation of implicit instructions and occlusion handling. Experiments show that ReasonGrounder outperforms state-of-the-art open-vocabulary 3D grounding methods in both localization accuracy andIoU, while also supporting complex reasoning and amodal perception with novel views, which is significant for vision-language navigation and robotics.

Abstract

Open-vocabulary 3D visual grounding and reasoning aim to localize objects in a scene based on implicit language descriptions, even when they are occluded. This ability is crucial for tasks such as vision-language navigation and autonomous robotics. However, current methods struggle because they rely heavily on fine-tuning with 3D annotations and mask proposals, which limits their ability to handle diverse semantics and common knowledge required for effective reasoning. In this work, we propose ReasonGrounder, an LVLM-guided framework that uses hierarchical 3D feature Gaussian fields for adaptive grouping based on physical scale, enabling open-vocabulary 3D grounding and reasoning. ReasonGrounder interprets implicit instructions using large vision-language models (LVLM) and localizes occluded objects through 3D Gaussian splatting. By incorporating 2D segmentation masks from the SAM and multi-view CLIP embeddings, ReasonGrounder selects Gaussian groups based on object scale, enabling accurate localization through both explicit and implicit language understanding, even in novel, occluded views. We also contribute ReasoningGD, a new dataset containing over 10K scenes and 2 million annotations for evaluating open-vocabulary 3D grounding and amodal perception under occlusion. Experiments show that ReasonGrounder significantly improves 3D grounding accuracy in real-world scenarios.

Paper Structure

This paper contains 18 sections, 10 equations, 16 figures, 11 tables.

Figures (16)

  • Figure 1: Examples of open-vocabulary 3D visual grounding and reasoning. In a given scene, the user observes from a perspective with occlusions and asks questions such as: "Can you localize the red, round, sweet fruit on the table that is partially occluded by the toy sheep?" Open-vocabulary 3D visual grounding and reasoning seeks to interpret complex implicit queries, deduce answers, and accurately localize the target object, even when it is partially or fully occluded from the current viewpoint.
  • Figure 2: The framework of our ReasonGrounder. Our ReasonGrounder leverages 3D Gaussian Splatting (3DGS) for efficient high-resolution rendering. It extracts 2D segmentation masks from SAM kirillov2023segment and maps them into a 3D field. Each mask is assigned a 3D scale based on depth from 3DGS. Latent feature vectors are appended to Gaussians and mapped into hierarchical language and instance features using two MLPs. CLIP embeddings ensure multi-view consistency, while contrastive loss refines the masks. A reference view is selected based on LVLM, guiding implicit instruction comprehension for accurate 3D localization and amodal perception in novel views.
  • Figure 3: The pipeline of scale-hierarchical feature Gaussian field. The method extracts 2D masks from SAM and projects them into a 3D field. ReasonGrounder adds a latent feature to each Gaussian, mapping it into hierarchical language and instance features. Language features are supervised by CLIP embeddings, while instance features refine masks using contrastive loss and 3D scale.
  • Figure 4: Qualitative comparisons of open-vocabulary 3D visual grounding. Our ReasonGrounder demonstrates superior accuracy in open-vocabulary 3D localization compared to other state-of-the-art methods.
  • Figure 5: Qualitative results of 3D localization with implicit instructions on the LERF, 3D-OVS, and ReasoningGD datasets. These results demonstrate that our ReasonGrounder can accurately interpret implicit instructions and identify the target object.
  • ...and 11 more figures