Segment Any 3D Object with Language

Seungjun Lee; Yuyang Zhao; Gim Hee Lee

Segment Any 3D Object with Language

Seungjun Lee, Yuyang Zhao, Gim Hee Lee

TL;DR

We address open-vocabulary 3D instance segmentation (OV-3DIS) with free-form language by training a 3D segmentation network that directly predicts semantic-related masks from point clouds. SOLE combines a multimodal fusion network—projecting $2$D CLIP features into 3D space and using a Cross Modality Decoder to fuse language in decoding—with three multimodal associations (Mask-Visual, Mask-Caption, Mask-Entity) to align segmentation with language cues, including DeCap-generated captions and noun-phrase extraction. The training objective combines mask and semantic losses via Hungarian matching, with a representative loss $\\mathcal{L} = rac{1}{N_m} \\\sum_j^{N_m} (\\lambda_{MMA} L_{MMA}^j + \\lambda_{dice} L_{dice}(\\\hat{m}_{\\sigma(j)}, m_j) + \\lambda_{BCE} L_{BCE}(\\hat{m}_{\\sigma(j)}, m_j))$, and inference uses a soft geometric mean fusion $p = \max(p(\\mathbf{f}^m), p(\\mathbf{f}^p))^{\\tau} \cdot \min(p(\\mathbf{f}^m), p(\\mathbf{f}^p))^{1-\\tau}$ with $\\tau=0.667$. Results on ScanNetv2, ScanNet200, and Replica show SOLE achieves state-of-the-art OV-3DIS performance, closely approaching fully supervised baselines and demonstrating robust responses to diverse language instructions.

Abstract

In this paper, we investigate Open-Vocabulary 3D Instance Segmentation (OV-3DIS) with free-form language instructions. Earlier works that rely on only annotated base categories for training suffer from limited generalization to unseen novel categories. Recent works mitigate poor generalizability to novel categories by generating class-agnostic masks or projecting generalized masks from 2D to 3D, but disregard semantic or geometry information, leading to sub-optimal performance. Instead, generating generalizable but semantic-related masks directly from 3D point clouds would result in superior outcomes. In this paper, we introduce Segment any 3D Object with LanguagE (SOLE), which is a semantic and geometric-aware visual-language learning framework with strong generalizability by generating semantic-related masks directly from 3D point clouds. Specifically, we propose a multimodal fusion network to incorporate multimodal semantics in both backbone and decoder. In addition, to align the 3D segmentation model with various language instructions and enhance the mask quality, we introduce three types of multimodal associations as supervision. Our SOLE outperforms previous methods by a large margin on ScanNetv2, ScanNet200, and Replica benchmarks, and the results are even close to the fully-supervised counterpart despite the absence of class annotations in the training. Furthermore, extensive qualitative results demonstrate the versatility of our SOLE to language instructions.

Segment Any 3D Object with Language

TL;DR

D CLIP features into 3D space and using a Cross Modality Decoder to fuse language in decoding—with three multimodal associations (Mask-Visual, Mask-Caption, Mask-Entity) to align segmentation with language cues, including DeCap-generated captions and noun-phrase extraction. The training objective combines mask and semantic losses via Hungarian matching, with a representative loss

, and inference uses a soft geometric mean fusion

with

. Results on ScanNetv2, ScanNet200, and Replica show SOLE achieves state-of-the-art OV-3DIS performance, closely approaching fully supervised baselines and demonstrating robust responses to diverse language instructions.

Abstract

Paper Structure (18 sections, 7 equations, 7 figures, 8 tables)

This paper contains 18 sections, 7 equations, 7 figures, 8 tables.

Introduction
Related Work
Method
Backbone Feature Ensemble
Cross Modality Decoder (CMD)
Vision-Language Learning
Training and Inference
Experiments
Experimental Setting
Comparison with Previous Methods
Ablation Studies and Analysis
Conclusion
Implementation Details
3D Visual Grounding
Analysis of CLIP Visual Feature
...and 3 more sections

Figures (7)

Figure 1: Qualitative results when querying SOLE with various language instructions. SOLE is highly generalizable and can segment corresponding instances with various language instructions, including but not limited to (a) visual questions, (b) attributes description, and (c) functional description.
Figure 2: Overall framework of SOLE. SOLE is built on transformer-based instance segmentation model with multimodal adaptations. For model architecture, backbone features are integrated with per-point CLIP features and subsequently fed into the cross-modality decoder (CMD). CMD aggregates the point-wise features and textual features into the instance queries, finally segmenting the instances, which are supervised by multimodal associations. During inference, predicted mask features are combined with the per-point CLIP features, enhancing the open-vocabulary performance.
Figure 3: Three types of multimodal association instance. For each ground truth instance mask, we first pool the per-point CLIP features to obtain Mask-Visual Association$\mathbf{f}^\text{MVA}$. Subsequently, $\mathbf{f}^\text{MVA}$ is fed into CLIP space captioning model to generate caption and corresponding textual feature $\mathbf{f}^\text{MCA}$ for each mask, termed as Mask-Caption Association. Finally, noun phrases are extracted from mask caption and the embeddings of them are aggregated via multimodal attention to get Mask-Entity Association$\mathbf{f}^\text{MEA}$. The three multimodal associations are used for supervising SOLE to acquire the ability to segment 3D objects with free-form language instructions.
Figure 4: Qualitative results from SOLE. Our SOLE demonstrates open-vocabulary capability by effectively responding to free-form language queries, including visual questions, attributes description and functional description.
Figure 5: Qualitative analysis on multimodal associations. Given the free-form language instruction, "I wanna see outside.", SOLE trained only with $\mathbf{f}^{\mathrm{MEA}}$ captures the wrong object ((a)), whereas it segments the related object when $\mathbf{f}^{\mathrm{MVA}}$ and $\mathbf{f}^{\mathrm{MCA}}$ are additionally given as the supervision ((b), (c)).
...and 2 more figures

Segment Any 3D Object with Language

TL;DR

Abstract

Segment Any 3D Object with Language

Authors

TL;DR

Abstract

Table of Contents

Figures (7)