B2N3D: Progressive Learning from Binary to N-ary Relationships for 3D Object Grounding
Feng Xiao, Hongbin Xu, Hai Ci, Wenxiong Kang
TL;DR
The paper tackles 3D object grounding under complex natural language that describes multiple spatial relations. It introduces B2N3D, a progressive relational learning framework that moves from binary to n-ary relationships guided by LLM-derived soft labels, coupled with an attention-driven graph network to fuse multi-modal information on a scene graph built from top-nary combinations. Key contributions include the B2N-PRL module, soft-relational label generation, and an attention-based graph learning pipeline that yields state-of-the-art results on Nr3D, Sr3D, and ScanRefer. This work enhances global relational perception in cluttered 3D scenes, enabling more robust and scalable robot vision in natural language-guided grounding tasks.
Abstract
Localizing 3D objects using natural language is essential for robotic scene understanding. The descriptions often involve multiple spatial relationships to distinguish similar objects, making 3D-language alignment difficult. Current methods only model relationships for pairwise objects, ignoring the global perceptual significance of n-ary combinations in multi-modal relational understanding. To address this, we propose a novel progressive relational learning framework for 3D object grounding. We extend relational learning from binary to n-ary to identify visual relations that match the referential description globally. Given the absence of specific annotations for referred objects in the training data, we design a grouped supervision loss to facilitate n-ary relational learning. In the scene graph created with n-ary relationships, we use a multi-modal network with hybrid attention mechanisms to further localize the target within the n-ary combinations. Experiments and ablation studies on the ReferIt3D and ScanRefer benchmarks demonstrate that our method outperforms the state-of-the-art, and proves the advantages of the n-ary relational perception in 3D localization.
