Table of Contents
Fetching ...

EPSegFZ: Efficient Point Cloud Semantic Segmentation for Few- and Zero-Shot Scenarios with Language Guidance

Jiahui Wang, Haiyue Zhu, Haoren Guo, Abdullah Al Mamun, Cheng Xiang, Tong Heng Lee

TL;DR

EPSegFZ tackles the challenge of few- and zero-shot 3D point-cloud semantic segmentation without relying on pre-training. It introduces ProERA to emphasize high-frequency foreground details, LGPE to fuse textual support via CLIP into prototypes, and DRPE to encode query–prototype spatial relations in latent space for precise cross-attention, all trained from scratch. The approach yields state-of-the-art mIoU on S3DIS and ScanNet and demonstrates robust zero-shot capability through language-guided prototypes, while maintaining low model complexity (~2.02M parameters) and efficient training. This work broadens practical FS-SemSeg by reducing dependence on large pre-trained backbones and by leveraging multimodal guidance to improve cross-domain adaptability and edge-focused segmentation quality.

Abstract

Recent approaches for few-shot 3D point cloud semantic segmentation typically require a two-stage learning process, i.e., a pre-training stage followed by a few-shot training stage. While effective, these methods face overreliance on pre-training, which hinders model flexibility and adaptability. Some models tried to avoid pre-training yet failed to capture ample information. In addition, current approaches focus on visual information in the support set and neglect or do not fully exploit other useful data, such as textual annotations. This inadequate utilization of support information impairs the performance of the model and restricts its zero-shot ability. To address these limitations, we present a novel pre-training-free network, named Efficient Point Cloud Semantic Segmentation for Few- and Zero-shot scenarios. Our EPSegFZ incorporates three key components. A Prototype-Enhanced Registers Attention (ProERA) module and a Dual Relative Positional Encoding (DRPE)-based cross-attention mechanism for improved feature extraction and accurate query-prototype correspondence construction without pre-training. A Language-Guided Prototype Embedding (LGPE) module that effectively leverages textual information from the support set to improve few-shot performance and enable zero-shot inference. Extensive experiments show that our method outperforms the state-of-the-art method by 5.68% and 3.82% on the S3DIS and ScanNet benchmarks, respectively.

EPSegFZ: Efficient Point Cloud Semantic Segmentation for Few- and Zero-Shot Scenarios with Language Guidance

TL;DR

EPSegFZ tackles the challenge of few- and zero-shot 3D point-cloud semantic segmentation without relying on pre-training. It introduces ProERA to emphasize high-frequency foreground details, LGPE to fuse textual support via CLIP into prototypes, and DRPE to encode query–prototype spatial relations in latent space for precise cross-attention, all trained from scratch. The approach yields state-of-the-art mIoU on S3DIS and ScanNet and demonstrates robust zero-shot capability through language-guided prototypes, while maintaining low model complexity (~2.02M parameters) and efficient training. This work broadens practical FS-SemSeg by reducing dependence on large pre-trained backbones and by leveraging multimodal guidance to improve cross-domain adaptability and edge-focused segmentation quality.

Abstract

Recent approaches for few-shot 3D point cloud semantic segmentation typically require a two-stage learning process, i.e., a pre-training stage followed by a few-shot training stage. While effective, these methods face overreliance on pre-training, which hinders model flexibility and adaptability. Some models tried to avoid pre-training yet failed to capture ample information. In addition, current approaches focus on visual information in the support set and neglect or do not fully exploit other useful data, such as textual annotations. This inadequate utilization of support information impairs the performance of the model and restricts its zero-shot ability. To address these limitations, we present a novel pre-training-free network, named Efficient Point Cloud Semantic Segmentation for Few- and Zero-shot scenarios. Our EPSegFZ incorporates three key components. A Prototype-Enhanced Registers Attention (ProERA) module and a Dual Relative Positional Encoding (DRPE)-based cross-attention mechanism for improved feature extraction and accurate query-prototype correspondence construction without pre-training. A Language-Guided Prototype Embedding (LGPE) module that effectively leverages textual information from the support set to improve few-shot performance and enable zero-shot inference. Extensive experiments show that our method outperforms the state-of-the-art method by 5.68% and 3.82% on the S3DIS and ScanNet benchmarks, respectively.

Paper Structure

This paper contains 34 sections, 12 equations, 17 figures, 16 tables, 1 algorithm.

Figures (17)

  • Figure 1: Visualized frequency spectrum of embedded features from Seg-PN (left) and Ours (right) (both are pre-training-free method). Our latent features are rich and uniform across frequency bands, while Seg-PN overlooks high-frequency components.
  • Figure 2: The visualized architecture of our EPSegFZ. A ProERA module first captures high-frequency information and refines the extracted feature. Then, an LGPE module dynamically updates the class prototypes with textual embeddings. After that, a DRPE-based cross-attention properly builds correspondence between prototypes and query features. Finally, the prediction result is obtained by dot production. The red block Avg. represents the average pooling operation.
  • Figure 3: Visualized t-SNE embedding of feature tokens for prediction. With our LGPE and DRPE, same-class features form a more compact distribution, enhancing the discriminative ability. Colored points represent semantic classes.
  • Figure 4: Visualized heatmaps of query-registers and query-prototypes similarities. The distinct focused region of registers helps the model differentiate between object-related and object-less areas. The updated prototypes effectively correlate with the query object, whereas the raw prototypes lack sufficient focus on the target object.
  • Figure 5: Visualized segmentation result on S3DIS dataset. Our method performs better in segmentation accuracy than the baseline.
  • ...and 12 more figures