MeshSegmenter: Zero-Shot Mesh Semantic Segmentation via Texture Synthesis

Ziming Zhong; Yanxu Xu; Jing Li; Jiale Xu; Zhengxin Li; Chaohui Yu; Shenghua Gao

MeshSegmenter: Zero-Shot Mesh Semantic Segmentation via Texture Synthesis

Ziming Zhong, Yanxu Xu, Jing Li, Jiale Xu, Zhengxin Li, Chaohui Yu, Shenghua Gao

TL;DR

MeshSegmenter addresses zero-shot 3D mesh semantic segmentation by transferring 2D multimodal segmentation capabilities to 3D via texture synthesis and multi-view fusion. It generates textures for untextured meshes using text-guided prompts with Stable Diffusion to provide color cues, enabling 2D models like SAM and GroundingDINO to segment from 2D renders. A multi-view Face Confidence Revoting (FCR) module aggregates results across viewpoints to enforce 3D consistency and suppress view-specific errors. On ShapeNetPart, MeshSegmenter achieves higher IoU than baselines and enables accurate, robust segmentation and downstream editing tasks, highlighting practical impact.

Abstract

We present MeshSegmenter, a simple yet effective framework designed for zero-shot 3D semantic segmentation. This model successfully extends the powerful capabilities of 2D segmentation models to 3D meshes, delivering accurate 3D segmentation across diverse meshes and segment descriptions. Specifically, our model leverages the Segment Anything Model (SAM) model to segment the target regions from images rendered from the 3D shape. In light of the importance of the texture for segmentation, we also leverage the pretrained stable diffusion model to generate images with textures from 3D shape, and leverage SAM to segment the target regions from images with textures. Textures supplement the shape for segmentation and facilitate accurate 3D segmentation even in geometrically non-prominent areas, such as segmenting a car door within a car mesh. To achieve the 3D segments, we render 2D images from different views and conduct segmentation for both textured and untextured images. Lastly, we develop a multi-view revoting scheme that integrates 2D segmentation results and confidence scores from various views onto the 3D mesh, ensuring the 3D consistency of segmentation results and eliminating inaccuracies from specific perspectives. Through these innovations, MeshSegmenter offers stable and reliable 3D segmentation results both quantitatively and qualitatively, highlighting its potential as a transformative tool in the field of 3D zero-shot segmentation. The code is available at \url{https://github.com/zimingzhong/MeshSegmenter}.

MeshSegmenter: Zero-Shot Mesh Semantic Segmentation via Texture Synthesis

TL;DR

Abstract

Paper Structure (14 sections, 5 equations, 12 figures, 3 tables)

This paper contains 14 sections, 5 equations, 12 figures, 3 tables.

Introduction
Related Work
Method
Overview
Text-Guided Texture Synthesis
2D zero-shot semantic segmentation
Face Confidence Revoting (FCR)
Experiment
Implementation Details
Zero-shot Mesh Semantic Segmentation
Application of MeshSegmenter
Ablation Studies
Limitation
Conclusion

Figures (12)

Figure 1: MeshSegmenter performs zero-shot mesh semantic segmentation through texture synthesis. It can accurately segment the text-specified region by aggregating multi-view 2D segmentation results from single and multiple queries.
Figure 2: Overview of the proposed pipeline. The Stable Diffusion (SD) model can generate high-quality textures under the guidance of textual prompts. Textured and untextured meshes are rendered from a fixed perspective. The rendered images are processed by GroundingDinoliu2023grounding and SAMkirillov2023segment, which detect bounding boxes with corresponding confidence scores and segment the specific regions guided by these boxes, respectively. Ultimately, we employ face confidence revoting (FCR) to aggregate the detection and segmentation results of the textured and untextured meshes from multiple viewpoints to revote the 3D-awara scores to triangles.
Figure 3: Performance of GroundingDINOliu2023grounding and SAMkirillov2023segment on $M_\text{Texture}$ and $M_\text{unTextured}$. The large-scale pre-trained models are typically trained on images rich in texture. However, there is a domain gap when these models are applied to untextured meshes. GroundingDINO liu2023grounding and SAMkirillov2023segment are respectively provided with textual prompts and bounding boxes for detection and segmentation tasks, demonstrating significantly superior performance on textured images compared to performance on untextured images.
Figure 4: Qualitative results in the single query: MeshSegmenter performs efficient zero-shot mesh semantic segmentation across diverse meshes.
Figure 5: Qualitative results in the multiple queries: MeshSegmenter performs accurate zero-shot mesh semantic segmentation, which does not rely on competition between multi queries segmentations. For instance, in the first column of MeshSegmenter, "chest" is not mistakenly segmented as "clothes". In the third column of MeshSegmenter, the "buttocks" are not mistakenly segmented as "leg" or "torso".
...and 7 more figures

MeshSegmenter: Zero-Shot Mesh Semantic Segmentation via Texture Synthesis

TL;DR

Abstract

MeshSegmenter: Zero-Shot Mesh Semantic Segmentation via Texture Synthesis

Authors

TL;DR

Abstract

Table of Contents

Figures (12)