Table of Contents
Fetching ...

B2N3D: Progressive Learning from Binary to N-ary Relationships for 3D Object Grounding

Feng Xiao, Hongbin Xu, Hai Ci, Wenxiong Kang

TL;DR

The paper tackles 3D object grounding under complex natural language that describes multiple spatial relations. It introduces B2N3D, a progressive relational learning framework that moves from binary to n-ary relationships guided by LLM-derived soft labels, coupled with an attention-driven graph network to fuse multi-modal information on a scene graph built from top-nary combinations. Key contributions include the B2N-PRL module, soft-relational label generation, and an attention-based graph learning pipeline that yields state-of-the-art results on Nr3D, Sr3D, and ScanRefer. This work enhances global relational perception in cluttered 3D scenes, enabling more robust and scalable robot vision in natural language-guided grounding tasks.

Abstract

Localizing 3D objects using natural language is essential for robotic scene understanding. The descriptions often involve multiple spatial relationships to distinguish similar objects, making 3D-language alignment difficult. Current methods only model relationships for pairwise objects, ignoring the global perceptual significance of n-ary combinations in multi-modal relational understanding. To address this, we propose a novel progressive relational learning framework for 3D object grounding. We extend relational learning from binary to n-ary to identify visual relations that match the referential description globally. Given the absence of specific annotations for referred objects in the training data, we design a grouped supervision loss to facilitate n-ary relational learning. In the scene graph created with n-ary relationships, we use a multi-modal network with hybrid attention mechanisms to further localize the target within the n-ary combinations. Experiments and ablation studies on the ReferIt3D and ScanRefer benchmarks demonstrate that our method outperforms the state-of-the-art, and proves the advantages of the n-ary relational perception in 3D localization.

B2N3D: Progressive Learning from Binary to N-ary Relationships for 3D Object Grounding

TL;DR

The paper tackles 3D object grounding under complex natural language that describes multiple spatial relations. It introduces B2N3D, a progressive relational learning framework that moves from binary to n-ary relationships guided by LLM-derived soft labels, coupled with an attention-driven graph network to fuse multi-modal information on a scene graph built from top-nary combinations. Key contributions include the B2N-PRL module, soft-relational label generation, and an attention-based graph learning pipeline that yields state-of-the-art results on Nr3D, Sr3D, and ScanRefer. This work enhances global relational perception in cluttered 3D scenes, enabling more robust and scalable robot vision in natural language-guided grounding tasks.

Abstract

Localizing 3D objects using natural language is essential for robotic scene understanding. The descriptions often involve multiple spatial relationships to distinguish similar objects, making 3D-language alignment difficult. Current methods only model relationships for pairwise objects, ignoring the global perceptual significance of n-ary combinations in multi-modal relational understanding. To address this, we propose a novel progressive relational learning framework for 3D object grounding. We extend relational learning from binary to n-ary to identify visual relations that match the referential description globally. Given the absence of specific annotations for referred objects in the training data, we design a grouped supervision loss to facilitate n-ary relational learning. In the scene graph created with n-ary relationships, we use a multi-modal network with hybrid attention mechanisms to further localize the target within the n-ary combinations. Experiments and ablation studies on the ReferIt3D and ScanRefer benchmarks demonstrate that our method outperforms the state-of-the-art, and proves the advantages of the n-ary relational perception in 3D localization.

Paper Structure

This paper contains 24 sections, 9 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: The target or referred objects mentioned in the description often have multiple similar instances present in the scene like "plant" and "desk" in the figure. Accurately localizing the target needs to identify the correct combination of objects that jointly satisfies the multi-relational constraints ("plant_9 desk_43 bookcase_34"). Compared to the previous methods, which only perceives the relationships of paired objects, our method employs a progressive approach to learn relational composition, achieving global perception of n-ary relationships.
  • Figure 2: The overall structure of our method B2N3D. It mainly consists of the textual and visual encoders, the B2N-PRL module, and the attention-driven graph learning module. We encapsulate textual descriptions as prompt inputs to the LLM and obtain entity combinations for training. Based on the uncertainty of referred objects (only entity names are known), we design losses $L_{br}$ and $L_{nr}$ to supervise the two relationship modeling processes in the B2N-PRL module. A scene graph is established from the predicted n-ary relationships, and multi-modal fusion is achieved through a graph-structured network. The object with the highest confidence is finally output as the predicted target.
  • Figure 3: The structure of the B2N-PRL module. The inputs are object features $O_i \in \mathbb{R}^{B \times N \times C}$ and text features $T \in \mathbb{R}^{B \times C}$, where $B$ denotes the batch size and $N$ is the number of objects in one scene. Both features have the same dimensions, $C$. B2N-PRL sequentially models binary and n-ary relationships, ultimately outputting the $K_2$ entity combinations that best match the textual content.
  • Figure 4: The graph-structured multi-modal network. The node features of the scene graph are input into the network. The graph node first fuses the binary features extracted from the B2N-PRL module and enhances the expression of spatial relationships through self-attention. Next, the text feature is fused with the node feature via cross-attention. The network contains $N$ graph attention layers to aggregate the node features for global relational perception.
  • Figure 5: The visualization examples of B2N3D on the human-annotated datasets Nr3D and ScanRefer. The first row is the result of the model without progressive relational learning. 3D bounding boxes represent the grounding results and ground truth. The entity words in the description are marked in orange for the target and in blue for the referred objects.
  • ...and 2 more figures