ZeroPS: High-quality Cross-modal Knowledge Transfer for Zero-Shot 3D Part Segmentation

Yuheng Xue; Nenglun Chen; Jun Liu; Wenyun Sun

ZeroPS: High-quality Cross-modal Knowledge Transfer for Zero-Shot 3D Part Segmentation

Yuheng Xue, Nenglun Chen, Jun Liu, Wenyun Sun

TL;DR

ZeroPS tackles zero-shot 3D part segmentation by transferring knowledge from 2D foundation models (SAM and GLIP) to 3D point clouds through a training-free, multi-view framework. It introduces self-extension to lift 2D SAM segments into 3D, a merging step to create coherent 3D parts, and CNVP/TDCM-based multi-model labeling to assign instance labels without training. Across PartNetE and AKBSeg, ZeroPS achieves substantial improvements over state-of-the-art zero-shot methods and narrows the gap to fully supervised approaches, while maintaining robustness to domain shifts. The approach offers a practical, scalable pathway for zero-shot 3D segmentation in real-world settings with minimal model modification.

Abstract

Zero-shot 3D part segmentation is a challenging and fundamental task. In this work, we propose a novel pipeline, ZeroPS, which achieves high-quality knowledge transfer from 2D pretrained foundation models (FMs), SAM and GLIP, to 3D object point clouds. We aim to explore the natural relationship between multi-view correspondence and the FMs' prompt mechanism and build bridges on it. In ZeroPS, the relationship manifests as follows: 1) lifting 2D to 3D by leveraging co-viewed regions and SAM's prompt mechanism, 2) relating 1D classes to 3D parts by leveraging 2D-3D view projection and GLIP's prompt mechanism, and 3) enhancing prediction performance by leveraging multi-view observations. Extensive evaluations on the PartNetE and AKBSeg benchmarks demonstrate that ZeroPS significantly outperforms the SOTA method across zero-shot unlabeled and instance segmentation tasks. ZeroPS does not require additional training or fine-tuning for the FMs. ZeroPS applies to both simulated and real-world data. It is hardly affected by domain shift. The project page is available at https://luis2088.github.io/ZeroPS_page.

ZeroPS: High-quality Cross-modal Knowledge Transfer for Zero-Shot 3D Part Segmentation

TL;DR

Abstract

Paper Structure (27 sections, 10 equations, 9 figures, 16 tables, 1 algorithm)

This paper contains 27 sections, 10 equations, 9 figures, 16 tables, 1 algorithm.

Introduction
Related Work
Supervised 3D Segmentation
2D Foundation Models (FMs)
Proposed Method: ZeroPS
Overview
Multi-view Correspondence
Self-extension
Merging 3D Groups
Multi-model Labeling
Experiments
Benchmark and Metric
Implementation Details
Comparison with Existing Methods
Ablation Study
...and 12 more sections

Figures (9)

Figure 1: Overview of the proposed pipeline ZeroPS. First, in the unlabeled segmentation phase, the input 3D object is segmented into unlabeled parts. The self-extension (See \ref{['fig:self-extension']}) component can extend 2D segmentation from a single viewpoint to 3D segmentation (3D groups), by using a predefined extension sequence starting from that viewpoint. For example, the red cue on the left side of the figure illustrates this process. Second, in the instance segmentation phase, given a text prompt, the multi-modal labeling (See \ref{['fig:mml']}) component assigns an instance label to each 3D unlabeled part.
Figure 2: The overall structure of self-extension (top subfigure). Given an extension sequence $S_i = [V_i, V_{i_1}, V_{i_2}, \ldots, V_{i_j}, \ldots, V_{i_{k-1}}]$, self-extension aims to obtain 2D groups from starting viewpoint $V_i$ and extends each group from 2D to 3D by the remaining viewpoints. Specifically, for the starting viewpoint $V_i$, self-extension utilizes 3D key points to guide SAM to segment the 2D image. As the segmented 2D groups originate from 2D segmentation results, self-extension continuously extends these groups to 3D segmentation results (3D groups) by SVE (Single Viewpoint Extension). During continuously extending, the remaining viewpoints in $S_i$, $[V_{i_1}, V_{i_2}, \ldots, V_{i_j}, \ldots, V_{i_{k-1}}]$, are iterated. At each iteration, inputting the current viewpoint and each group, SVE extends each input group. As an example, a detailed process of how SVE extends a single group is provided in the bottom right subfigure.
Figure 3: The overall structure of multi-modal labeling.
Figure 4: Qualitative comparison on zero-shot instance segmentation (zoom in for details). Left: PartNetE's simulated data. Right: AKBSeg's real-world data. The red dashed boxes indicate that our method produces more accurate 3D segmentation boundaries compared to the SOTA method, PartSLIP.
Figure 5: Ablation Study on self-extension by the 'Extending' and 'Without Extending' settings. The Average IoU is the overall result on PartNetE. See \ref{['sec:ablationExtending']} for details.
...and 4 more figures

ZeroPS: High-quality Cross-modal Knowledge Transfer for Zero-Shot 3D Part Segmentation

TL;DR

Abstract

ZeroPS: High-quality Cross-modal Knowledge Transfer for Zero-Shot 3D Part Segmentation

Authors

TL;DR

Abstract

Table of Contents

Figures (9)