Open-Vocabulary Semantic Part Segmentation of 3D Human

Keito Suzuki; Bang Du; Girish Krishnan; Kunyao Chen; Runfa Blark Li; Truong Nguyen

Open-Vocabulary Semantic Part Segmentation of 3D Human

Keito Suzuki, Bang Du, Girish Krishnan, Kunyao Chen, Runfa Blark Li, Truong Nguyen

TL;DR

This work tackles open-vocabulary semantic segmentation for 3D humans, a problem hampered by limited labeled data and poor generalization to unseen models. It introduces a pipeline that renders multi-view images, uses SAM for 2D mask proposals, and employs a new HumanCLIP to produce discriminative, human-centric embeddings, followed by MaskFusion to fuse view-consistent predictions into a 3D semantic mask set driven by text prompts. The approach achieves state-of-the-art performance across five 3D human datasets and is compatible with meshes, point clouds, and 3D Gaussian Splatting, while supporting promptable segmentation and robustness to in-the-wild data. Limitations include runtime bottlenecks from SAM and the need for threshold tuning, suggesting avenues for data-driven thresholding and faster mask proposal strategies in future work.

Abstract

3D part segmentation is still an open problem in the field of 3D vision and AR/VR. Due to limited 3D labeled data, traditional supervised segmentation methods fall short in generalizing to unseen shapes and categories. Recently, the advancement in vision-language models' zero-shot abilities has brought a surge in open-world 3D segmentation methods. While these methods show promising results for 3D scenes or objects, they do not generalize well to 3D humans. In this paper, we present the first open-vocabulary segmentation method capable of handling 3D human. Our framework can segment the human category into desired fine-grained parts based on the textual prompt. We design a simple segmentation pipeline, leveraging SAM to generate multi-view proposals in 2D and proposing a novel HumanCLIP model to create unified embeddings for visual and textual inputs. Compared with existing pre-trained CLIP models, the HumanCLIP model yields more accurate embeddings for human-centric contents. We also design a simple-yet-effective MaskFusion module, which classifies and fuses multi-view features into 3D semantic masks without complex voting and grouping mechanisms. The design of decoupling mask proposals and text input also significantly boosts the efficiency of per-prompt inference. Experimental results on various 3D human datasets show that our method outperforms current state-of-the-art open-vocabulary 3D segmentation methods by a large margin. In addition, we show that our method can be directly applied to various 3D representations including meshes, point clouds, and 3D Gaussian Splatting.

Open-Vocabulary Semantic Part Segmentation of 3D Human

TL;DR

Abstract

Open-Vocabulary Semantic Part Segmentation of 3D Human

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (16)