Table of Contents
Fetching ...

LESS: Label-Efficient and Single-Stage Referring 3D Segmentation

Xuexun Liu, Xiaoxu Xu, Jinlong Li, Qiudan Zhang, Xu Wang, Nicu Sebe, Lin Ma

TL;DR

A novel Referring 3D Segmentation pipeline, Label-Efficient and Single-Stage, dubbed LESS is proposed, which achieves state-of-the-art performance on ScanRefer dataset by surpassing the previous methods about 3.7% mIoU using only binary labels.

Abstract

Referring 3D Segmentation is a visual-language task that segments all points of the specified object from a 3D point cloud described by a sentence of query. Previous works perform a two-stage paradigm, first conducting language-agnostic instance segmentation then matching with given text query. However, the semantic concepts from text query and visual cues are separately interacted during the training, and both instance and semantic labels for each object are required, which is time consuming and human-labor intensive. To mitigate these issues, we propose a novel Referring 3D Segmentation pipeline, Label-Efficient and Single-Stage, dubbed LESS, which is only under the supervision of efficient binary mask. Specifically, we design a Point-Word Cross-Modal Alignment module for aligning the fine-grained features of points and textual embedding. Query Mask Predictor module and Query-Sentence Alignment module are introduced for coarse-grained alignment between masks and query. Furthermore, we propose an area regularization loss, which coarsely reduces irrelevant background predictions on a large scale. Besides, a point-to-point contrastive loss is proposed concentrating on distinguishing points with subtly similar features. Through extensive experiments, we achieve state-of-the-art performance on ScanRefer dataset by surpassing the previous methods about 3.7% mIoU using only binary labels. Code is available at https://github.com/mellody11/LESS.

LESS: Label-Efficient and Single-Stage Referring 3D Segmentation

TL;DR

A novel Referring 3D Segmentation pipeline, Label-Efficient and Single-Stage, dubbed LESS is proposed, which achieves state-of-the-art performance on ScanRefer dataset by surpassing the previous methods about 3.7% mIoU using only binary labels.

Abstract

Referring 3D Segmentation is a visual-language task that segments all points of the specified object from a 3D point cloud described by a sentence of query. Previous works perform a two-stage paradigm, first conducting language-agnostic instance segmentation then matching with given text query. However, the semantic concepts from text query and visual cues are separately interacted during the training, and both instance and semantic labels for each object are required, which is time consuming and human-labor intensive. To mitigate these issues, we propose a novel Referring 3D Segmentation pipeline, Label-Efficient and Single-Stage, dubbed LESS, which is only under the supervision of efficient binary mask. Specifically, we design a Point-Word Cross-Modal Alignment module for aligning the fine-grained features of points and textual embedding. Query Mask Predictor module and Query-Sentence Alignment module are introduced for coarse-grained alignment between masks and query. Furthermore, we propose an area regularization loss, which coarsely reduces irrelevant background predictions on a large scale. Besides, a point-to-point contrastive loss is proposed concentrating on distinguishing points with subtly similar features. Through extensive experiments, we achieve state-of-the-art performance on ScanRefer dataset by surpassing the previous methods about 3.7% mIoU using only binary labels. Code is available at https://github.com/mellody11/LESS.

Paper Structure

This paper contains 40 sections, 8 equations, 5 figures, 10 tables.

Figures (5)

  • Figure 1: Comparison between the two-stage method and our single-stage method. (a) The two-stage method initially performs instance segmentation with instance labels then semantic labels to get the instance proposals and bases on the provided query to match the most relevant instance proposal. (b) Our single-stage method only utilizes the binary mask of the described object for training and integrates language and vision features during feature extraction.
  • Figure 2: Overview of our LESS framework. Given a point cloud scene $P$, we use a sparse 3D feature extractor to extract multi-scale feature $V_{i}$. The query $T$ is sent to a text encoder and we obtain the word features $W$ and sentence features $S$. Meanwhile, we introduce a PWCA module aligns the word features $W$ with the multi-scale point cloud features $V_{i}$. After that, an $m$-layer QMP module is adopted to decode $K$ learnable queries $Q_{0}$ base on the fused feature $F$, and output query embeddings $Q_{m}$ and proposal masks $M_{m}$. Finally, QSA module aligns the query embeddings $Q_{m}$ with sentence features $S$, i.e., computes the similarity scores $R$ that filter the proposal masks $M_{m}$ to the final mask prediction $\hat{M}$.
  • Figure 3: Final predictions using different combinations of loss functions. The queries and input scenes are shown in column 1 and 2. Columns 3 to 5 indicate the gradual addition of loss functions.
  • Figure 4: There different types of labels.
  • Figure 5: Qualitative results of both the success cases and failure cases of our LESS.