R2G: Reasoning to Ground in 3D Scenes

Yixuan Li; Zan Wang; Wei Liang

R2G: Reasoning to Ground in 3D Scenes

Yixuan Li, Zan Wang, Wei Liang

TL;DR

R2G reframes 3D referential grounding as interpretable reasoning over a semantic concept-based scene graph. It builds node states containing object category and attributes $s=(s^0,\ldots,s^L)$ and edges encoding spatial relations, all embedded in a shared concept space $C=C_O\cup C_A\cup C_R$ and mapped to $\mathbb{R}^d$ via GloVe embeddings to enable cross-modal similarity. The referential utterance is converted into a sequence of instructions by either a learning-based encoder-decoder or an LLM, and grounding proceeds through multi-round attention transfers along the graph guided by these instructions, with a grounding loss $\mathcal{L}_{ref}$ and auxiliary losses $\mathcal{L}_t,\mathcal{L}_a,\mathcal{L}_r$ to enforce correct language parsing. On Sr3D/Nr3D and NS3D benchmarks, R2G achieves competitive accuracy while delivering substantially improved interpretability through explicit reasoning steps, illustrating the potential of neural-symbolic approaches for robust, generalizable 3D grounding. The work points to a new direction where grounding in 3D scenes is achieved via explicit semantic concepts and structured reasoning, rather than opaque end-to-end feature fusion.

Abstract

We propose Reasoning to Ground (R2G), a neural symbolic model that grounds the target objects within 3D scenes in a reasoning manner. In contrast to prior works, R2G explicitly models the 3D scene with a semantic concept-based scene graph; recurrently simulates the attention transferring across object entities; thus makes the process of grounding the target objects with the highest probability interpretable. Specifically, we respectively embed multiple object properties within the graph nodes and spatial relations among entities within the edges, utilizing a predefined semantic vocabulary. To guide attention transferring, we employ learning or prompting-based methods to analyze the referential utterance and convert it into reasoning instructions within the same semantic space. In each reasoning round, R2G either (1) merges current attention distribution with the similarity between the instruction and embedded entity properties or (2) shifts the attention across the scene graph based on the similarity between the instruction and embedded spatial relations. The experiments on Sr3D/Nr3D benchmarks show that R2G achieves a comparable result with the prior works while maintaining improved interpretability, breaking a new path for 3D language grounding.

R2G: Reasoning to Ground in 3D Scenes

TL;DR

R2G reframes 3D referential grounding as interpretable reasoning over a semantic concept-based scene graph. It builds node states containing object category and attributes

and edges encoding spatial relations, all embedded in a shared concept space

and mapped to

via GloVe embeddings to enable cross-modal similarity. The referential utterance is converted into a sequence of instructions by either a learning-based encoder-decoder or an LLM, and grounding proceeds through multi-round attention transfers along the graph guided by these instructions, with a grounding loss

and auxiliary losses

to enforce correct language parsing. On Sr3D/Nr3D and NS3D benchmarks, R2G achieves competitive accuracy while delivering substantially improved interpretability through explicit reasoning steps, illustrating the potential of neural-symbolic approaches for robust, generalizable 3D grounding. The work points to a new direction where grounding in 3D scenes is achieved via explicit semantic concepts and structured reasoning, rather than opaque end-to-end feature fusion.

Abstract

Paper Structure (29 sections, 10 equations, 7 figures, 4 tables)

This paper contains 29 sections, 10 equations, 7 figures, 4 tables.

Introduction
Related Work
Visual Grounding
Visual Reasoning
Method
Semantic Representation
Scene Graph Construction
Object State
Spatial Relation
Instruction Generation
Learning-based
LLM-based
Reasoning
Training Loss
Implementation Details
...and 14 more sections

Figures (7)

Figure 1: Comparison between R2G and previous models. The prior works (bottom) focus on matching the utterance feature with the object proposal's features to select the target object with the highest probability in an end-to-end manner. In contrast, R2G (top) grounds the target object step by step via human-like attention transferring across the scene graph, using the parsed language description as guidance.
Figure 2: Overview of R2G. R2G represents the 3D scene with a semantic concept-based scene graph and parses the referential utterance into instructions to guide the attention transferring across the scene graph in a reasoning manner. After several reasoning rounds, we localize the target object with the highest attention score.
Figure 3: Attention transferring. R2G transfers the attention from the source node to the target node along the directed edge, guided by the spatial-relation-related instruction.
Figure 4: Qualitative results. We visualize two examples of the attention-transferring process on Sr3D in three reasoning rounds. R2G gradually focuses more on the target object. We visualize the attention score of partial objects for better visualization.
Figure 5: Qualitative analysis of the end-to-end model. The end-to-end model produces an erroneous classification (a) but successfully achieves a correct grounding result (b).
...and 2 more figures

R2G: Reasoning to Ground in 3D Scenes

TL;DR

Abstract

R2G: Reasoning to Ground in 3D Scenes

Authors

TL;DR

Abstract

Table of Contents

Figures (7)