Table of Contents
Fetching ...

FOLK: Fast Open-Vocabulary 3D Instance Segmentation via Label-guided Knowledge Distillation

Hongrui Wu, Zhicheng Gao, Jin Cao, Kelu Yao, Wen Shen, Zhihua Wei

TL;DR

FOLK tackles open-vocabulary 3D instance segmentation by distilling CLIP knowledge from 2D images into a 3D model, enabling direct classification from point clouds and avoiding 2D occlusion noise. It comprises a teacher that produces high-quality multi-view 2D CLIP embeddings with mask-guided pooling, a 3D VL-adapter-based student that outputs 3D embeddings, and a label-guided distillation pipeline that aligns the 3D embeddings with the teacher's open-vocabulary space using a contrastive loss and a CLIP-based label supervision. The approach yields state-of-the-art results on ScanNet200 (e.g., AP$_{50}$ of $35.7$) while delivering substantial inference speedups (roughly $6.0$ to $152.2\times$ faster). This framework demonstrates the practical impact of transferring open-vocabulary knowledge to 3D perception, enabling scalable deployment in real-world applications with efficient, occlusion-robust 3D inference.

Abstract

Open-vocabulary 3D instance segmentation seeks to segment and classify instances beyond the annotated label space. Existing methods typically map 3D instances to 2D RGB-D images, and then employ vision-language models (VLMs) for classification. However, such a mapping strategy usually introduces noise from 2D occlusions and incurs substantial computational and memory costs during inference, slowing down the inference speed. To address the above problems, we propose a Fast Open-vocabulary 3D instance segmentation method via Label-guided Knowledge distillation (FOLK). Our core idea is to design a teacher model that extracts high-quality instance embeddings and distills its open-vocabulary knowledge into a 3D student model. In this way, during inference, the distilled 3D model can directly classify instances from the 3D point cloud, avoiding noise caused by occlusions and significantly accelerating the inference process. Specifically, we first design a teacher model to generate a 2D CLIP embedding for each 3D instance, incorporating both visibility and viewpoint diversity, which serves as the learning target for distillation. We then develop a 3D student model that directly produces a 3D embedding for each 3D instance. During training, we propose a label-guided distillation algorithm to distill open-vocabulary knowledge from label-consistent 2D embeddings into the student model. FOLK conducted experiments on the ScanNet200 and Replica datasets, achieving state-of-the-art performance on the ScanNet200 dataset with an AP50 score of 35.7, while running approximately 6.0x to 152.2x faster than previous methods. All codes will be released after the paper is accepted.

FOLK: Fast Open-Vocabulary 3D Instance Segmentation via Label-guided Knowledge Distillation

TL;DR

FOLK tackles open-vocabulary 3D instance segmentation by distilling CLIP knowledge from 2D images into a 3D model, enabling direct classification from point clouds and avoiding 2D occlusion noise. It comprises a teacher that produces high-quality multi-view 2D CLIP embeddings with mask-guided pooling, a 3D VL-adapter-based student that outputs 3D embeddings, and a label-guided distillation pipeline that aligns the 3D embeddings with the teacher's open-vocabulary space using a contrastive loss and a CLIP-based label supervision. The approach yields state-of-the-art results on ScanNet200 (e.g., AP of ) while delivering substantial inference speedups (roughly to faster). This framework demonstrates the practical impact of transferring open-vocabulary knowledge to 3D perception, enabling scalable deployment in real-world applications with efficient, occlusion-robust 3D inference.

Abstract

Open-vocabulary 3D instance segmentation seeks to segment and classify instances beyond the annotated label space. Existing methods typically map 3D instances to 2D RGB-D images, and then employ vision-language models (VLMs) for classification. However, such a mapping strategy usually introduces noise from 2D occlusions and incurs substantial computational and memory costs during inference, slowing down the inference speed. To address the above problems, we propose a Fast Open-vocabulary 3D instance segmentation method via Label-guided Knowledge distillation (FOLK). Our core idea is to design a teacher model that extracts high-quality instance embeddings and distills its open-vocabulary knowledge into a 3D student model. In this way, during inference, the distilled 3D model can directly classify instances from the 3D point cloud, avoiding noise caused by occlusions and significantly accelerating the inference process. Specifically, we first design a teacher model to generate a 2D CLIP embedding for each 3D instance, incorporating both visibility and viewpoint diversity, which serves as the learning target for distillation. We then develop a 3D student model that directly produces a 3D embedding for each 3D instance. During training, we propose a label-guided distillation algorithm to distill open-vocabulary knowledge from label-consistent 2D embeddings into the student model. FOLK conducted experiments on the ScanNet200 and Replica datasets, achieving state-of-the-art performance on the ScanNet200 dataset with an AP50 score of 35.7, while running approximately 6.0x to 152.2x faster than previous methods. All codes will be released after the paper is accepted.

Paper Structure

This paper contains 13 sections, 12 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: An overview of our method. The FOLK method consists of three main components: (1) The teacher model uses the multi-view selection algorithm and the density-guided mask completion algorithm to obtain pairs of representative images and dense masks, which are then passed into the mask-guided CLIP visual encoder to get high-quality 2D CLIP embeddings. (2) The student model uses a VL-adapter to derive 3D instance embeddings from the point-wise features. (3) The label-guided distillation algorithm transfers the knowledge from 2D CLIP embeddings into the 3D instance embeddings.
  • Figure 2: Multi-view selection algorithm and density-guided densification algorithm. For each instance, the multi-view selection algorithm first projects 3D points on 2D images to get a set of pixels which could form sparse 2D masks and sorts images by the number of projected pixels to get the top-$K_\text{pre}$ images (a), and then removes images with similar poses (b). Given the corresponding 2D sparse masks of the selected images, the density-guided mask completion algorithm firstly performs a coarse uniform expansion (c), and then iteratively conducts the density-guided expansion to get 2D dense masks (d).
  • Figure 3: Qualitative results on the Scannet200 dataset. We show detection results from our method alongside two representative baselines: OpenMask3D and OpenYOLO-3D. Our method detects more candidate instances with higher classification accuracy, and demonstrates superior segmentation quality in terms of both precision and recall.
  • Figure 4: Time breakdown of the average runtime for open-vocabulary 3D instance segmentation on the ScanNet200 dataset.
  • Figure 5: Additional qualitative results on ScanNet200.
  • ...and 1 more figures