Table of Contents
Fetching ...

Toward Explainable and Fine-Grained 3D Grounding through Referring Textual Phrases

Zhihao Yuan, Xu Yan, Zhuo Li, Xuhao Li, Yao Guo, Shuguang Cui, Zhen Li

TL;DR

This work extends 3D Visual Grounding to a phrase-aware setting (3DPAG), enabling the localization of all objects mentioned in a sentence and their contextual relationships in 3D scenes. It introduces a phrase-object alignment (POA) map derived from cross-attention and a phrase-specific pre-training (PSP) strategy to capture fine-grained, phrase-level cues, supported by new phrase-annotated datasets Nr3D++, Sr3D++, and ScanRefer++. The approach delivers substantial improvements over prior 3DVG methods across Nr3D, Sr3D, and ScanRefer, and demonstrates strong gains even when using detector-based proposals, while also providing an interpretable grounding map that links each object to sentence tokens. Together, these contributions advance explainable 3D grounding and offer practical benefits for downstream 3D scene understanding tasks. This method yields state-of-the-art performance on both 3DVG and 3DPAG benchmarks and highlights the value of fine-grained, phrase-level supervision in multimodal 3D reasoning.

Abstract

Recent progress in 3D scene understanding has explored visual grounding (3DVG) to localize a target object through a language description. However, existing methods only consider the dependency between the entire sentence and the target object, ignoring fine-grained relationships between contexts and non-target ones. In this paper, we extend 3DVG to a more fine-grained and interpretable task, called 3D Phrase Aware Grounding (3DPAG). The 3DPAG task aims to localize the target objects in a 3D scene by explicitly identifying all phrase-related objects and then conducting the reasoning according to contextual phrases. To tackle this problem, we manually labeled about 227K phrase-level annotations using a self-developed platform, from 88K sentences of widely used 3DVG datasets, i.e., Nr3D, Sr3D and ScanRefer. By tapping on our datasets, we can extend previous 3DVG methods to the fine-grained phrase-aware scenario. It is achieved through the proposed novel phrase-object alignment optimization and phrase-specific pre-training, boosting conventional 3DVG performance as well. Extensive results confirm significant improvements, i.e., previous state-of-the-art method achieves 3.9%, 3.5% and 4.6% overall accuracy gains on Nr3D, Sr3D and ScanRefer respectively.

Toward Explainable and Fine-Grained 3D Grounding through Referring Textual Phrases

TL;DR

This work extends 3D Visual Grounding to a phrase-aware setting (3DPAG), enabling the localization of all objects mentioned in a sentence and their contextual relationships in 3D scenes. It introduces a phrase-object alignment (POA) map derived from cross-attention and a phrase-specific pre-training (PSP) strategy to capture fine-grained, phrase-level cues, supported by new phrase-annotated datasets Nr3D++, Sr3D++, and ScanRefer++. The approach delivers substantial improvements over prior 3DVG methods across Nr3D, Sr3D, and ScanRefer, and demonstrates strong gains even when using detector-based proposals, while also providing an interpretable grounding map that links each object to sentence tokens. Together, these contributions advance explainable 3D grounding and offer practical benefits for downstream 3D scene understanding tasks. This method yields state-of-the-art performance on both 3DVG and 3DPAG benchmarks and highlights the value of fine-grained, phrase-level supervision in multimodal 3D reasoning.

Abstract

Recent progress in 3D scene understanding has explored visual grounding (3DVG) to localize a target object through a language description. However, existing methods only consider the dependency between the entire sentence and the target object, ignoring fine-grained relationships between contexts and non-target ones. In this paper, we extend 3DVG to a more fine-grained and interpretable task, called 3D Phrase Aware Grounding (3DPAG). The 3DPAG task aims to localize the target objects in a 3D scene by explicitly identifying all phrase-related objects and then conducting the reasoning according to contextual phrases. To tackle this problem, we manually labeled about 227K phrase-level annotations using a self-developed platform, from 88K sentences of widely used 3DVG datasets, i.e., Nr3D, Sr3D and ScanRefer. By tapping on our datasets, we can extend previous 3DVG methods to the fine-grained phrase-aware scenario. It is achieved through the proposed novel phrase-object alignment optimization and phrase-specific pre-training, boosting conventional 3DVG performance as well. Extensive results confirm significant improvements, i.e., previous state-of-the-art method achieves 3.9%, 3.5% and 4.6% overall accuracy gains on Nr3D, Sr3D and ScanRefer respectively.
Paper Structure (20 sections, 5 equations, 5 figures, 5 tables)

This paper contains 20 sections, 5 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: 3D Phrase-Aware Visual Grounding (3DPAG). Compared with the previous 3DVG (a) only grounding the sole target object, 3DPAG (b) requires the neural listener to identify all phrase-aware objects (target and non-target) in the 3D scene and then explicitly conduct reasoning over all objects through the contexts. For the right part of (b), the 3D bounding boxes (BBox) are annotated using the same color as the corresponding object phrases in the sentence. For the bottom part of (b), we visualize the grounding attention score on the point cloud with respect to the given phrases based on our proposed method.
  • Figure 2: 3D phrase aware grounding (3DPAG) architecture with Phrase-Object Alignment (POA) Optimization. Part (a) illustrates the 3DPAG architecture, and part (b) shows the process of optimizing the phrase-object alignment map.
  • Figure 3: Phrase-Specific Training. In the phrase-specific pre-training stage, we generate the phrase-specific masks according to the grounding truth phrases, e.g., setting the position of "the office chair" to $1$ and other positions to $0$. Then, we design the network to predict the corresponding object of the selected phrase. During the fine-tuning stage, we only predict the target object referred in the sentence.
  • Figure 4: Visualization Results of 3DVG. We visualize the visual grounding results of SAT and ours. The four left examples are our correct predictions, while SAT failed. The two right examples show the representative failures for both the baseline and our method. The green/red/blue colors illustrate the correct/incorrect/GT boxes. The target class for each query is shown in red color. We provide rendered scenes in the first row for better visualization. Best viewed in color.
  • Figure 5: Visualization Results of 3DPAG. We show examples of 3DPAG prediction of our method. The corresponding phrase and bounding box are drawn in the same color.