PartDistill: 3D Shape Part Segmentation by Vision-Language Model Distillation
Ardian Umam, Cheng-Kun Yang, Min-Hung Chen, Jen-Hui Chuang, Yen-Yu Lin
TL;DR
PartDistill tackles zero-shot and few-shot 3D shape part segmentation by distilling 2D knowledge from vision-language models into a 3D learner. It introduces bi-directional distillation, where 2D predictions guide a 3D encoder (forward distillation) and the resulting 3D predictions refine 2D cues (backward distillation), while back-projection and mask-aware losses handle incomplete 2D coverage. The framework supports both bounding-box and pixel-level VLMs and can incorporate generated shapes to augment knowledge sources. Across ShapeNetPart and PartNetE, PartDistill yields substantial mIoU gains over state-of-the-art zero-shot and few-shot baselines, demonstrating strong cross-modal generalization and robustness to VLM imperfections. The approach offers practical impact for scalable 3D annotation-free segmentation and can exploit synthetic data to further boost performance.
Abstract
This paper proposes a cross-modal distillation framework, PartDistill, which transfers 2D knowledge from vision-language models (VLMs) to facilitate 3D shape part segmentation. PartDistill addresses three major challenges in this task: the lack of 3D segmentation in invisible or undetected regions in the 2D projections, inconsistent 2D predictions by VLMs, and the lack of knowledge accumulation across different 3D shapes. PartDistill consists of a teacher network that uses a VLM to make 2D predictions and a student network that learns from the 2D predictions while extracting geometrical features from multiple 3D shapes to carry out 3D part segmentation. A bi-directional distillation, including forward and backward distillations, is carried out within the framework, where the former forward distills the 2D predictions to the student network, and the latter improves the quality of the 2D predictions, which subsequently enhances the final 3D segmentation. Moreover, PartDistill can exploit generative models that facilitate effortless 3D shape creation for generating knowledge sources to be distilled. Through extensive experiments, PartDistill boosts the existing methods with substantial margins on widely used ShapeNetPart and PartNetE datasets, by more than 15% and 12% higher mIoU scores, respectively. The code for this work is available at https://github.com/ardianumam/PartDistill.
