Dual Attribute-Spatial Relation Alignment for 3D Visual Grounding
Yue Xu, Kaizhi Yang, Jiebo Luo, Xuejin Chen
TL;DR
The paper addresses 3D visual grounding by aligning natural language with 3D point clouds. It proposes DASANet, a dual-branch transformer network that decouples language and vision into object-attribute and spatial-relation streams, using self- and cross-attention to fuse global context with fine-grained cues. A CLIP-style alignment with a GTAS training strategy enables explicit disentanglement of attributes and spatial relations, achieving state-of-the-art Nr3D accuracy (65.1%) and strong Sr3D performance, along with interpretable branch-wise scores. The approach improves fine-grained cross-modal alignment and provides insights into spatial reasoning under challenging 3D scenes, with potential impact on embodied perception and robotics. Future work points to data augmentation and richer cross-modal supervision to further enhance robustness and generalization.
Abstract
3D visual grounding is an emerging research area dedicated to making connections between the 3D physical world and natural language, which is crucial for achieving embodied intelligence. In this paper, we propose DASANet, a Dual Attribute-Spatial relation Alignment Network that separately models and aligns object attributes and spatial relation features between language and 3D vision modalities. We decompose both the language and 3D point cloud input into two separate parts and design a dual-branch attention module to separately model the decomposed inputs while preserving global context in attribute-spatial feature fusion by cross attentions. Our DASANet achieves the highest grounding accuracy 65.1% on the Nr3D dataset, 1.3% higher than the best competitor. Besides, the visualization of the two branches proves that our method is efficient and highly interpretable.
