Table of Contents
Fetching ...

IPDN: Image-enhanced Prompt Decoding Network for 3D Referring Expression Segmentation

Qi Chen, Changli Wu, Jiayi Ji, Yiwei Ma, Danni Yang, Xiaoshuai Sun

TL;DR

This work tackles 3D Referring Expression Segmentation by addressing feature ambiguity and intent ambiguity in point-cloud scenes. It introduces IPDN, which combines Multi-view Semantic Embedding (MSE) to fuse CLIP-based multi-view 2D semantics into 3D features with Spatial-aware Attention, and a Prompt-aware Decoder (PAD) to derive task-driven prompts from text–visual interactions. The approach achieves state-of-the-art results on ScanRefer and Multi3DRefer, delivering up to 1.9 points higher mIoU on 3D-RES and 4.2 points on 3D-GRES, while exhibiting robustness to long-tail and distractor-heavy scenarios. By integrating large-scale 2D pretraining and targeted prompt-guided decoding, IPDN demonstrates strong practical impact for cross-modal 3D grounding in real-world scenes.

Abstract

3D Referring Expression Segmentation (3D-RES) aims to segment point cloud scenes based on a given expression. However, existing 3D-RES approaches face two major challenges: feature ambiguity and intent ambiguity. Feature ambiguity arises from information loss or distortion during point cloud acquisition due to limitations such as lighting and viewpoint. Intent ambiguity refers to the model's equal treatment of all queries during the decoding process, lacking top-down task-specific guidance. In this paper, we introduce an Image enhanced Prompt Decoding Network (IPDN), which leverages multi-view images and task-driven information to enhance the model's reasoning capabilities. To address feature ambiguity, we propose the Multi-view Semantic Embedding (MSE) module, which injects multi-view 2D image information into the 3D scene and compensates for potential spatial information loss. To tackle intent ambiguity, we designed a Prompt-Aware Decoder (PAD) that guides the decoding process by deriving task-driven signals from the interaction between the expression and visual features. Comprehensive experiments demonstrate that IPDN outperforms the state-ofthe-art by 1.9 and 4.2 points in mIoU metrics on the 3D-RES and 3D-GRES tasks, respectively.

IPDN: Image-enhanced Prompt Decoding Network for 3D Referring Expression Segmentation

TL;DR

This work tackles 3D Referring Expression Segmentation by addressing feature ambiguity and intent ambiguity in point-cloud scenes. It introduces IPDN, which combines Multi-view Semantic Embedding (MSE) to fuse CLIP-based multi-view 2D semantics into 3D features with Spatial-aware Attention, and a Prompt-aware Decoder (PAD) to derive task-driven prompts from text–visual interactions. The approach achieves state-of-the-art results on ScanRefer and Multi3DRefer, delivering up to 1.9 points higher mIoU on 3D-RES and 4.2 points on 3D-GRES, while exhibiting robustness to long-tail and distractor-heavy scenarios. By integrating large-scale 2D pretraining and targeted prompt-guided decoding, IPDN demonstrates strong practical impact for cross-modal 3D grounding in real-world scenes.

Abstract

3D Referring Expression Segmentation (3D-RES) aims to segment point cloud scenes based on a given expression. However, existing 3D-RES approaches face two major challenges: feature ambiguity and intent ambiguity. Feature ambiguity arises from information loss or distortion during point cloud acquisition due to limitations such as lighting and viewpoint. Intent ambiguity refers to the model's equal treatment of all queries during the decoding process, lacking top-down task-specific guidance. In this paper, we introduce an Image enhanced Prompt Decoding Network (IPDN), which leverages multi-view images and task-driven information to enhance the model's reasoning capabilities. To address feature ambiguity, we propose the Multi-view Semantic Embedding (MSE) module, which injects multi-view 2D image information into the 3D scene and compensates for potential spatial information loss. To tackle intent ambiguity, we designed a Prompt-Aware Decoder (PAD) that guides the decoding process by deriving task-driven signals from the interaction between the expression and visual features. Comprehensive experiments demonstrate that IPDN outperforms the state-ofthe-art by 1.9 and 4.2 points in mIoU metrics on the 3D-RES and 3D-GRES tasks, respectively.
Paper Structure (29 sections, 17 equations, 3 figures, 6 tables)

This paper contains 29 sections, 17 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: The pipeline of (a) the previous traditional query-based framework and (b) our method.
  • Figure 2: The overview of our framework.
  • Figure 3: Qualitative comparison between MDIN and ours.