Table of Contents
Fetching ...

SeCG: Semantic-Enhanced 3D Visual Grounding via Cross-modal Graph Attention

Feng Xiao, Hongbin Xu, Qiuxia Wu, Wenxiong Kang

TL;DR

SeCG, a semantic-enhanced relational learning model based on a graph network based on a graph network with the authors' designed memory graph attention layer is proposed, which replaces original language-independent encoding with cross-modal encoding in visual analysis.

Abstract

3D visual grounding aims to automatically locate the 3D region of the specified object given the corresponding textual description. Existing works fail to distinguish similar objects especially when multiple referred objects are involved in the description. Experiments show that direct matching of language and visual modal has limited capacity to comprehend complex referential relationships in utterances. It is mainly due to the interference caused by redundant visual information in cross-modal alignment. To strengthen relation-orientated mapping between different modalities, we propose SeCG, a semantic-enhanced relational learning model based on a graph network with our designed memory graph attention layer. Our method replaces original language-independent encoding with cross-modal encoding in visual analysis. More text-related feature expressions are obtained through the guidance of global semantics and implicit relationships. Experimental results on ReferIt3D and ScanRefer benchmarks show that the proposed method outperforms the existing state-of-the-art methods, particularly improving the localization performance for the multi-relation challenges.

SeCG: Semantic-Enhanced 3D Visual Grounding via Cross-modal Graph Attention

TL;DR

SeCG, a semantic-enhanced relational learning model based on a graph network based on a graph network with the authors' designed memory graph attention layer is proposed, which replaces original language-independent encoding with cross-modal encoding in visual analysis.

Abstract

3D visual grounding aims to automatically locate the 3D region of the specified object given the corresponding textual description. Existing works fail to distinguish similar objects especially when multiple referred objects are involved in the description. Experiments show that direct matching of language and visual modal has limited capacity to comprehend complex referential relationships in utterances. It is mainly due to the interference caused by redundant visual information in cross-modal alignment. To strengthen relation-orientated mapping between different modalities, we propose SeCG, a semantic-enhanced relational learning model based on a graph network with our designed memory graph attention layer. Our method replaces original language-independent encoding with cross-modal encoding in visual analysis. More text-related feature expressions are obtained through the guidance of global semantics and implicit relationships. Experimental results on ReferIt3D and ScanRefer benchmarks show that the proposed method outperforms the existing state-of-the-art methods, particularly improving the localization performance for the multi-relation challenges.
Paper Structure (22 sections, 7 equations, 6 figures, 4 tables)

This paper contains 22 sections, 7 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Comparison of results without(a) and with(b) our multi-relation improvement, (c) shows the ground truth and related objects. The green words in the utterances are target names and the blues are references. The decomposed pairwise relationships are framed on the text, corresponding to the dashed lines of the same color in above pictures.
  • Figure 2: The overall architecture of our proposed model, SeCG, consists of four modules: semantic-enhanced encoding, relation graph learning, text encoding, and Transformer decoding. The input data are segmented instance point clouds and referential utterance, the semantic point cloud is generated as an intermediate result. In each scene, $N$ objects are used to construct the relation graph. The text generates a $m$-dimensional memory matrix in each graph updating layer.
  • Figure 3: The implementation process of our proposed memory graph attention(MGA) layer. Given a full-connected graph with object features as nodes, a multi-modal memory unit is added to the key and value of the attention operator to update the node values.
  • Figure 4: The illustration of a two-layer graph network structure based on MGA layers. Positioning embedding of $R$ views only works in the first layer.
  • Figure 5: Visualization results on visual grounding samples with multiple referred objects from Nr3D. The words represent targets are highlighted in green and the reference words are highlighted in blue. Our localization results are compared with MVT huang2022multi that directly matches text features with multi-view visual features.
  • ...and 1 more figures