Table of Contents
Fetching ...

RG-SAN: Rule-Guided Spatial Awareness Network for End-to-End 3D Referring Expression Segmentation

Changli Wu, Qi Chen, Jiayi Ji, Haowei Wang, Yiwei Ma, You Huang, Gen Luo, Hao Fei, Xiaoshuai Sun, Rongrong Ji

TL;DR

RG-SAN tackles 3D-RES by elevating spatial reasoning among all text-described entities. It couples a Text-driven Localization Module that initializes and iteratively refines noun positions with a Dependency-tree guided Rule-guided Weak Supervision strategy, which supervises only the target position but influences other nouns through spatial rules. The method uses cross-modal interactions with absolute and relative positional encodings to fuse text and 3D features, achieving state-of-the-art results on ScanRefer with notable gains in mIoU and robustness to spatial ambiguity. The work demonstrates the value of explicit spatial awareness in 3D vision-language grounding and offers a practical, open-source pipeline for end-to-end 3D-RES.

Abstract

3D Referring Expression Segmentation (3D-RES) aims to segment 3D objects by correlating referring expressions with point clouds. However, traditional approaches frequently encounter issues like over-segmentation or mis-segmentation, due to insufficient emphasis on spatial information of instances. In this paper, we introduce a Rule-Guided Spatial Awareness Network (RG-SAN) by utilizing solely the spatial information of the target instance for supervision. This approach enables the network to accurately depict the spatial relationships among all entities described in the text, thus enhancing the reasoning capabilities. The RG-SAN consists of the Text-driven Localization Module (TLM) and the Rule-guided Weak Supervision (RWS) strategy. The TLM initially locates all mentioned instances and iteratively refines their positional information. The RWS strategy, acknowledging that only target objects have supervised positional information, employs dependency tree rules to precisely guide the core instance's positioning. Extensive testing on the ScanRefer benchmark has shown that RG-SAN not only establishes new performance benchmarks, with an mIoU increase of 5.1 points, but also exhibits significant improvements in robustness when processing descriptions with spatial ambiguity. All codes are available at https://github.com/sosppxo/RG-SAN.

RG-SAN: Rule-Guided Spatial Awareness Network for End-to-End 3D Referring Expression Segmentation

TL;DR

RG-SAN tackles 3D-RES by elevating spatial reasoning among all text-described entities. It couples a Text-driven Localization Module that initializes and iteratively refines noun positions with a Dependency-tree guided Rule-guided Weak Supervision strategy, which supervises only the target position but influences other nouns through spatial rules. The method uses cross-modal interactions with absolute and relative positional encodings to fuse text and 3D features, achieving state-of-the-art results on ScanRefer with notable gains in mIoU and robustness to spatial ambiguity. The work demonstrates the value of explicit spatial awareness in 3D vision-language grounding and offers a practical, open-source pipeline for end-to-end 3D-RES.

Abstract

3D Referring Expression Segmentation (3D-RES) aims to segment 3D objects by correlating referring expressions with point clouds. However, traditional approaches frequently encounter issues like over-segmentation or mis-segmentation, due to insufficient emphasis on spatial information of instances. In this paper, we introduce a Rule-Guided Spatial Awareness Network (RG-SAN) by utilizing solely the spatial information of the target instance for supervision. This approach enables the network to accurately depict the spatial relationships among all entities described in the text, thus enhancing the reasoning capabilities. The RG-SAN consists of the Text-driven Localization Module (TLM) and the Rule-guided Weak Supervision (RWS) strategy. The TLM initially locates all mentioned instances and iteratively refines their positional information. The RWS strategy, acknowledging that only target objects have supervised positional information, employs dependency tree rules to precisely guide the core instance's positioning. Extensive testing on the ScanRefer benchmark has shown that RG-SAN not only establishes new performance benchmarks, with an mIoU increase of 5.1 points, but also exhibits significant improvements in robustness when processing descriptions with spatial ambiguity. All codes are available at https://github.com/sosppxo/RG-SAN.

Paper Structure

This paper contains 37 sections, 16 equations, 6 figures, 9 tables, 1 algorithm.

Figures (6)

  • Figure 1: Illustration with a target object and multiple auxiliary objects, associated with a referring expression. The target marked in green represents the main referred instance, while targets in other colors indicate other mentioned entities. This visual highlights the challenge of effectively completing semantic reasoning in the absence of spatial inference.
  • Figure 2: An overview of the proposed RG-SAN. This model analyzes a point cloud and a textual description with $\mathcal{N}_t$ tokens, extracting superpoints and word-level features. The TLM assigns spatial positions to tokens, facilitating multimodal fusion. The RWS strategy enables the model to learn the positions of all mentioned entities using only the supervision of the target position.
  • Figure 3: Visualization of all the nouns in the textual description. Our RG-SAN can segment instances corresponding to different nouns, while 3D-STMN indiscriminately assigns all nouns to the target instance. Zoom in for best view.
  • Figure 4: Statistics of samples in the ScanRefer dataset based on the presence of spatial relation descriptions, where "spatial" represents samples with spatially related descriptions, while "w/o spatial" denotes spatially unrelated samples.
  • Figure 5: Qualitative comparison between the proposed RG-SAN and 3D-STMN. Zoom in for best view.
  • ...and 1 more figures