Table of Contents
Fetching ...

3D-GRES: Generalized 3D Referring Expression Segmentation

Changli Wu, Yihang Liu, Jiayi Ji, Yiwei Ma, Haowei Wang, Gen Luo, Henghui Ding, Xiaoshuai Sun, Rongrong Ji

TL;DR

This work defines 3D-GRES, a generalized task that segments any number of targets in 3D point clouds from natural language. It introduces MDIN, a Multi-Query Decoupled Interaction Network, featuring Text-driven Sparse Queries (TSQ) and Multi-object Decoupling Optimization (MDO) to decouple and align multiple target queries with visual and linguistic cues. The approach achieves state-of-the-art performance on the Multi3DRes dataset and demonstrates strong improvements over traditional 3D-RES methods, particularly in zero- and multi-target scenarios. The work advances practical 3D understanding for robotics and interactive systems by enabling flexible, language-guided multi-object segmentation in complex scenes.

Abstract

3D Referring Expression Segmentation (3D-RES) is dedicated to segmenting a specific instance within a 3D space based on a natural language description. However, current approaches are limited to segmenting a single target, restricting the versatility of the task. To overcome this limitation, we introduce Generalized 3D Referring Expression Segmentation (3D-GRES), which extends the capability to segment any number of instances based on natural language instructions. In addressing this broader task, we propose the Multi-Query Decoupled Interaction Network (MDIN), designed to break down multi-object segmentation tasks into simpler, individual segmentations. MDIN comprises two fundamental components: Text-driven Sparse Queries (TSQ) and Multi-object Decoupling Optimization (MDO). TSQ generates sparse point cloud features distributed over key targets as the initialization for queries. Meanwhile, MDO is tasked with assigning each target in multi-object scenarios to different queries while maintaining their semantic consistency. To adapt to this new task, we build a new dataset, namely Multi3DRes. Our comprehensive evaluations on this dataset demonstrate substantial enhancements over existing models, thus charting a new path for intricate multi-object 3D scene comprehension. The benchmark and code are available at https://github.com/sosppxo/MDIN.

3D-GRES: Generalized 3D Referring Expression Segmentation

TL;DR

This work defines 3D-GRES, a generalized task that segments any number of targets in 3D point clouds from natural language. It introduces MDIN, a Multi-Query Decoupled Interaction Network, featuring Text-driven Sparse Queries (TSQ) and Multi-object Decoupling Optimization (MDO) to decouple and align multiple target queries with visual and linguistic cues. The approach achieves state-of-the-art performance on the Multi3DRes dataset and demonstrates strong improvements over traditional 3D-RES methods, particularly in zero- and multi-target scenarios. The work advances practical 3D understanding for robotics and interactive systems by enabling flexible, language-guided multi-object segmentation in complex scenes.

Abstract

3D Referring Expression Segmentation (3D-RES) is dedicated to segmenting a specific instance within a 3D space based on a natural language description. However, current approaches are limited to segmenting a single target, restricting the versatility of the task. To overcome this limitation, we introduce Generalized 3D Referring Expression Segmentation (3D-GRES), which extends the capability to segment any number of instances based on natural language instructions. In addressing this broader task, we propose the Multi-Query Decoupled Interaction Network (MDIN), designed to break down multi-object segmentation tasks into simpler, individual segmentations. MDIN comprises two fundamental components: Text-driven Sparse Queries (TSQ) and Multi-object Decoupling Optimization (MDO). TSQ generates sparse point cloud features distributed over key targets as the initialization for queries. Meanwhile, MDO is tasked with assigning each target in multi-object scenarios to different queries while maintaining their semantic consistency. To adapt to this new task, we build a new dataset, namely Multi3DRes. Our comprehensive evaluations on this dataset demonstrate substantial enhancements over existing models, thus charting a new path for intricate multi-object 3D scene comprehension. The benchmark and code are available at https://github.com/sosppxo/MDIN.
Paper Structure (39 sections, 24 equations, 6 figures, 7 tables)

This paper contains 39 sections, 24 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Traditional 3D-RES is limited to single-target cases (1). In contrast, 3D-GRES can handle scenarios with any number of targets, including no target (2), single target, and multiple targets (3-5).
  • Figure 2: The overall framework of MDIN, comprising its core modules TSQ and MDO. The input point cloud and text undergo feature extraction before being fed into the TSQ module to extract sparse decoupled queries. Subsequently, the MDIN module performs multimodal fusion and prediction. Finally, the MDO module carries out decoupled optimization.
  • Figure 3: Qualitative comparison between the proposed MDIN and 3D-STMN. Zoom in for the best view.
  • Figure 4: The coverage rate and repetition rate of seed queries for all instances / Ground Truth instances in the scene.
  • Figure 5: Qualitative comparison between the proposed MDIN and 3D-STMN on multi-targets cases. Zoom in for the best view.
  • ...and 1 more figures