PartDistill: 3D Shape Part Segmentation by Vision-Language Model Distillation

Ardian Umam; Cheng-Kun Yang; Min-Hung Chen; Jen-Hui Chuang; Yen-Yu Lin

PartDistill: 3D Shape Part Segmentation by Vision-Language Model Distillation

Ardian Umam, Cheng-Kun Yang, Min-Hung Chen, Jen-Hui Chuang, Yen-Yu Lin

TL;DR

PartDistill tackles zero-shot and few-shot 3D shape part segmentation by distilling 2D knowledge from vision-language models into a 3D learner. It introduces bi-directional distillation, where 2D predictions guide a 3D encoder (forward distillation) and the resulting 3D predictions refine 2D cues (backward distillation), while back-projection and mask-aware losses handle incomplete 2D coverage. The framework supports both bounding-box and pixel-level VLMs and can incorporate generated shapes to augment knowledge sources. Across ShapeNetPart and PartNetE, PartDistill yields substantial mIoU gains over state-of-the-art zero-shot and few-shot baselines, demonstrating strong cross-modal generalization and robustness to VLM imperfections. The approach offers practical impact for scalable 3D annotation-free segmentation and can exploit synthetic data to further boost performance.

Abstract

This paper proposes a cross-modal distillation framework, PartDistill, which transfers 2D knowledge from vision-language models (VLMs) to facilitate 3D shape part segmentation. PartDistill addresses three major challenges in this task: the lack of 3D segmentation in invisible or undetected regions in the 2D projections, inconsistent 2D predictions by VLMs, and the lack of knowledge accumulation across different 3D shapes. PartDistill consists of a teacher network that uses a VLM to make 2D predictions and a student network that learns from the 2D predictions while extracting geometrical features from multiple 3D shapes to carry out 3D part segmentation. A bi-directional distillation, including forward and backward distillations, is carried out within the framework, where the former forward distills the 2D predictions to the student network, and the latter improves the quality of the 2D predictions, which subsequently enhances the final 3D segmentation. Moreover, PartDistill can exploit generative models that facilitate effortless 3D shape creation for generating knowledge sources to be distilled. Through extensive experiments, PartDistill boosts the existing methods with substantial margins on widely used ShapeNetPart and PartNetE datasets, by more than 15% and 12% higher mIoU scores, respectively. The code for this work is available at https://github.com/ardianumam/PartDistill.

PartDistill: 3D Shape Part Segmentation by Vision-Language Model Distillation

TL;DR

Abstract

Paper Structure (30 sections, 8 equations, 8 figures, 7 tables)

This paper contains 30 sections, 8 equations, 8 figures, 7 tables.

Introduction
Related Work
Vision-language models.
3D part segmentation using vision-language models.
2D to 3D distillation.
Proposed Method
Overview
Forward distillation: 2D to 3D
Distillation loss.
Backward distillation: 3D to 2D
Test-time alignment
Implementation Details
Experiments
Dataset and evaluation metric
Zero-shot segmentation
...and 15 more sections

Figures (8)

Figure 1: We present a distillation method that carries out zero-shot 3D shape part segmentation with a 2D vision-language model. After projecting an input 3D point cloud into multi-view 2D images, the 2D teacher (2D-T) and the 3D student (3D-S) networks are applied to the 2D images and 3D point cloud, respectively. Instead of direct transfer, our method carries bi-directional distillations, including forward and backward distillations, and yields better 3D part segmentation than the existing method.
Figure 2: Overview of the proposed method. (a) The overall pipeline where the knowledge extracted from a vision-language model (VLM) is distilled to carry out 3D shape part segmentation by teaching a 3D student network. Within the pipeline, backward distillation is introduced to re-score the teacher's knowledge based on its quality and subsequently improve the final 3D part prediction. (b) $\&$ (c) Knowledge is extracted by back-projection when we adopt (b) a bounding-box VLM (B-VLM) or (c) a pixel-wise VLM (P-VLM), where $\Gamma$ and $\mathbb{C}$ denote 2D-to-3D back-projection and connected component labeling connectedcomp_2019, respectively.
Figure 3: Given the VLM output of view $v$, $B^v$ or $S^v$, we display the confidence scores before ($C$) and after ($C_{bd}$) performing backward distillation via Eq. \ref{['eq:knowledge_refinement']}, with $Y$ and $M$ obtained via Eq. \ref{['eq:backproject']}. With backward distillation, inaccurate VLM predictions have lower scores, such as the arm box in B-VLM with the score reduced from 0.7 to 0.1, and vice versa.
Figure 4: Visualization of the zero-shot segmentation results, drawn in different colors, on the ShapeNetPart dataset. We render PartSLIP results on the ShapeNetPart data to have the same visualization of shape inputs. While occluded and undetected regions (issue $\boldsymbol{\mathcal{I}_1}$) are shown with black and gray colors, respectively, the blue and red arrows highlight several cases of issues $\boldsymbol{\mathcal{I}_2}$ and $\boldsymbol{\mathcal{I}_3}.$
Figure 5: Ablation study on number of views and various shape types for 2D multiview rendering on the ShapeNetPart dataset.
...and 3 more figures

PartDistill: 3D Shape Part Segmentation by Vision-Language Model Distillation

TL;DR

Abstract

PartDistill: 3D Shape Part Segmentation by Vision-Language Model Distillation

Authors

TL;DR

Abstract

Table of Contents

Figures (8)