Table of Contents
Fetching ...

Dual Attribute-Spatial Relation Alignment for 3D Visual Grounding

Yue Xu, Kaizhi Yang, Jiebo Luo, Xuejin Chen

TL;DR

The paper addresses 3D visual grounding by aligning natural language with 3D point clouds. It proposes DASANet, a dual-branch transformer network that decouples language and vision into object-attribute and spatial-relation streams, using self- and cross-attention to fuse global context with fine-grained cues. A CLIP-style alignment with a GTAS training strategy enables explicit disentanglement of attributes and spatial relations, achieving state-of-the-art Nr3D accuracy (65.1%) and strong Sr3D performance, along with interpretable branch-wise scores. The approach improves fine-grained cross-modal alignment and provides insights into spatial reasoning under challenging 3D scenes, with potential impact on embodied perception and robotics. Future work points to data augmentation and richer cross-modal supervision to further enhance robustness and generalization.

Abstract

3D visual grounding is an emerging research area dedicated to making connections between the 3D physical world and natural language, which is crucial for achieving embodied intelligence. In this paper, we propose DASANet, a Dual Attribute-Spatial relation Alignment Network that separately models and aligns object attributes and spatial relation features between language and 3D vision modalities. We decompose both the language and 3D point cloud input into two separate parts and design a dual-branch attention module to separately model the decomposed inputs while preserving global context in attribute-spatial feature fusion by cross attentions. Our DASANet achieves the highest grounding accuracy 65.1% on the Nr3D dataset, 1.3% higher than the best competitor. Besides, the visualization of the two branches proves that our method is efficient and highly interpretable.

Dual Attribute-Spatial Relation Alignment for 3D Visual Grounding

TL;DR

The paper addresses 3D visual grounding by aligning natural language with 3D point clouds. It proposes DASANet, a dual-branch transformer network that decouples language and vision into object-attribute and spatial-relation streams, using self- and cross-attention to fuse global context with fine-grained cues. A CLIP-style alignment with a GTAS training strategy enables explicit disentanglement of attributes and spatial relations, achieving state-of-the-art Nr3D accuracy (65.1%) and strong Sr3D performance, along with interpretable branch-wise scores. The approach improves fine-grained cross-modal alignment and provides insights into spatial reasoning under challenging 3D scenes, with potential impact on embodied perception and robotics. Future work points to data augmentation and richer cross-modal supervision to further enhance robustness and generalization.

Abstract

3D visual grounding is an emerging research area dedicated to making connections between the 3D physical world and natural language, which is crucial for achieving embodied intelligence. In this paper, we propose DASANet, a Dual Attribute-Spatial relation Alignment Network that separately models and aligns object attributes and spatial relation features between language and 3D vision modalities. We decompose both the language and 3D point cloud input into two separate parts and design a dual-branch attention module to separately model the decomposed inputs while preserving global context in attribute-spatial feature fusion by cross attentions. Our DASANet achieves the highest grounding accuracy 65.1% on the Nr3D dataset, 1.3% higher than the best competitor. Besides, the visualization of the two branches proves that our method is efficient and highly interpretable.
Paper Structure (12 sections, 5 equations, 5 figures, 3 tables)

This paper contains 12 sections, 5 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Various grounding network architectures of feature embedding and cross-modal fusion in different granularity.
  • Figure 2: Overview of our DASANet. Both the 3D point cloud and text inputs are first decomposed into the object and spatial part. Our dual-branch network, containing attribute branch and spatial relation branch, performs fusion and reasoning on these two aspects respectively. We combine the object scores of the two branches to get the final grounding results.
  • Figure 3: Illustration of the attribute attention module.
  • Figure 4: Qualitative comparison of the grounding results in the Nr3D dataset. Our grounding results are highlighted with yellow boxes, and the results from other methods are presented with blue boxes. In the ground truth, green boxes represent the target objects, while red boxes denote distractors (objects of the same category as the target).
  • Figure 5: Visualization of the attribute, spatial relation, and overall scores in our dual-branch network.