Toward Explainable and Fine-Grained 3D Grounding through Referring Textual Phrases
Zhihao Yuan, Xu Yan, Zhuo Li, Xuhao Li, Yao Guo, Shuguang Cui, Zhen Li
TL;DR
This work extends 3D Visual Grounding to a phrase-aware setting (3DPAG), enabling the localization of all objects mentioned in a sentence and their contextual relationships in 3D scenes. It introduces a phrase-object alignment (POA) map derived from cross-attention and a phrase-specific pre-training (PSP) strategy to capture fine-grained, phrase-level cues, supported by new phrase-annotated datasets Nr3D++, Sr3D++, and ScanRefer++. The approach delivers substantial improvements over prior 3DVG methods across Nr3D, Sr3D, and ScanRefer, and demonstrates strong gains even when using detector-based proposals, while also providing an interpretable grounding map that links each object to sentence tokens. Together, these contributions advance explainable 3D grounding and offer practical benefits for downstream 3D scene understanding tasks. This method yields state-of-the-art performance on both 3DVG and 3DPAG benchmarks and highlights the value of fine-grained, phrase-level supervision in multimodal 3D reasoning.
Abstract
Recent progress in 3D scene understanding has explored visual grounding (3DVG) to localize a target object through a language description. However, existing methods only consider the dependency between the entire sentence and the target object, ignoring fine-grained relationships between contexts and non-target ones. In this paper, we extend 3DVG to a more fine-grained and interpretable task, called 3D Phrase Aware Grounding (3DPAG). The 3DPAG task aims to localize the target objects in a 3D scene by explicitly identifying all phrase-related objects and then conducting the reasoning according to contextual phrases. To tackle this problem, we manually labeled about 227K phrase-level annotations using a self-developed platform, from 88K sentences of widely used 3DVG datasets, i.e., Nr3D, Sr3D and ScanRefer. By tapping on our datasets, we can extend previous 3DVG methods to the fine-grained phrase-aware scenario. It is achieved through the proposed novel phrase-object alignment optimization and phrase-specific pre-training, boosting conventional 3DVG performance as well. Extensive results confirm significant improvements, i.e., previous state-of-the-art method achieves 3.9%, 3.5% and 4.6% overall accuracy gains on Nr3D, Sr3D and ScanRefer respectively.
