Table of Contents
Fetching ...

Open-Vocabulary Semantic Part Segmentation of 3D Human

Keito Suzuki, Bang Du, Girish Krishnan, Kunyao Chen, Runfa Blark Li, Truong Nguyen

TL;DR

This work tackles open-vocabulary semantic segmentation for 3D humans, a problem hampered by limited labeled data and poor generalization to unseen models. It introduces a pipeline that renders multi-view images, uses SAM for 2D mask proposals, and employs a new HumanCLIP to produce discriminative, human-centric embeddings, followed by MaskFusion to fuse view-consistent predictions into a 3D semantic mask set driven by text prompts. The approach achieves state-of-the-art performance across five 3D human datasets and is compatible with meshes, point clouds, and 3D Gaussian Splatting, while supporting promptable segmentation and robustness to in-the-wild data. Limitations include runtime bottlenecks from SAM and the need for threshold tuning, suggesting avenues for data-driven thresholding and faster mask proposal strategies in future work.

Abstract

3D part segmentation is still an open problem in the field of 3D vision and AR/VR. Due to limited 3D labeled data, traditional supervised segmentation methods fall short in generalizing to unseen shapes and categories. Recently, the advancement in vision-language models' zero-shot abilities has brought a surge in open-world 3D segmentation methods. While these methods show promising results for 3D scenes or objects, they do not generalize well to 3D humans. In this paper, we present the first open-vocabulary segmentation method capable of handling 3D human. Our framework can segment the human category into desired fine-grained parts based on the textual prompt. We design a simple segmentation pipeline, leveraging SAM to generate multi-view proposals in 2D and proposing a novel HumanCLIP model to create unified embeddings for visual and textual inputs. Compared with existing pre-trained CLIP models, the HumanCLIP model yields more accurate embeddings for human-centric contents. We also design a simple-yet-effective MaskFusion module, which classifies and fuses multi-view features into 3D semantic masks without complex voting and grouping mechanisms. The design of decoupling mask proposals and text input also significantly boosts the efficiency of per-prompt inference. Experimental results on various 3D human datasets show that our method outperforms current state-of-the-art open-vocabulary 3D segmentation methods by a large margin. In addition, we show that our method can be directly applied to various 3D representations including meshes, point clouds, and 3D Gaussian Splatting.

Open-Vocabulary Semantic Part Segmentation of 3D Human

TL;DR

This work tackles open-vocabulary semantic segmentation for 3D humans, a problem hampered by limited labeled data and poor generalization to unseen models. It introduces a pipeline that renders multi-view images, uses SAM for 2D mask proposals, and employs a new HumanCLIP to produce discriminative, human-centric embeddings, followed by MaskFusion to fuse view-consistent predictions into a 3D semantic mask set driven by text prompts. The approach achieves state-of-the-art performance across five 3D human datasets and is compatible with meshes, point clouds, and 3D Gaussian Splatting, while supporting promptable segmentation and robustness to in-the-wild data. Limitations include runtime bottlenecks from SAM and the need for threshold tuning, suggesting avenues for data-driven thresholding and faster mask proposal strategies in future work.

Abstract

3D part segmentation is still an open problem in the field of 3D vision and AR/VR. Due to limited 3D labeled data, traditional supervised segmentation methods fall short in generalizing to unseen shapes and categories. Recently, the advancement in vision-language models' zero-shot abilities has brought a surge in open-world 3D segmentation methods. While these methods show promising results for 3D scenes or objects, they do not generalize well to 3D humans. In this paper, we present the first open-vocabulary segmentation method capable of handling 3D human. Our framework can segment the human category into desired fine-grained parts based on the textual prompt. We design a simple segmentation pipeline, leveraging SAM to generate multi-view proposals in 2D and proposing a novel HumanCLIP model to create unified embeddings for visual and textual inputs. Compared with existing pre-trained CLIP models, the HumanCLIP model yields more accurate embeddings for human-centric contents. We also design a simple-yet-effective MaskFusion module, which classifies and fuses multi-view features into 3D semantic masks without complex voting and grouping mechanisms. The design of decoupling mask proposals and text input also significantly boosts the efficiency of per-prompt inference. Experimental results on various 3D human datasets show that our method outperforms current state-of-the-art open-vocabulary 3D segmentation methods by a large margin. In addition, we show that our method can be directly applied to various 3D representations including meshes, point clouds, and 3D Gaussian Splatting.

Paper Structure

This paper contains 27 sections, 3 equations, 16 figures, 6 tables.

Figures (16)

  • Figure 1: We propose the first open-vocabulary method for the segmentation of 3D human. It infers 3D segmentation by rendering multi-view images and leveraging pre-trained vision-language models. The figure displays the input text prompts and the corresponding segmentation results for 3D humans from various datasets. Our method supports arbitrary queries and generates non-overlapping masks in the 3D model. See Figure \ref{['Fig.promptable_seg2']} and Figure \ref{['Fig.visual']} for more results.
  • Figure 2: Overview of the proposed framework. Given a 3D human model, it is first rendered to obtain multi-view 2D images. The images are then fed to SAM to generate class-agnostic 2D masks and unprojected to obtain binary 3D masks. Additionally, each pair of image and 2D masks are fed to the human-centric mask-based text-aligned image encoder to obtain CLIP embeddings for each mask. Simultaneously, the input class texts are fed to the text encoder to obtain corresponding text embeddings. The 3D mask proposals, mask embeddings, and text embeddings are fed to the mask fusion module to obtain the final segmentation result.
  • Figure 3: AlphaCLIP Image Encoder.
  • Figure 4: Comparison between (a) pre-trained AlphaCLIP and (b) the proposed HumanCLIP. The plots show the cosine similarity between the embedding of the masked region corresponding to face, glasses, left shoe, and right shoe and their text embeddings.
  • Figure 5: Example of mask-caption pairs generated by utilizing KOSMOS-2 and SAM.
  • ...and 11 more figures