EPSegFZ: Efficient Point Cloud Semantic Segmentation for Few- and Zero-Shot Scenarios with Language Guidance
Jiahui Wang, Haiyue Zhu, Haoren Guo, Abdullah Al Mamun, Cheng Xiang, Tong Heng Lee
TL;DR
EPSegFZ tackles the challenge of few- and zero-shot 3D point-cloud semantic segmentation without relying on pre-training. It introduces ProERA to emphasize high-frequency foreground details, LGPE to fuse textual support via CLIP into prototypes, and DRPE to encode query–prototype spatial relations in latent space for precise cross-attention, all trained from scratch. The approach yields state-of-the-art mIoU on S3DIS and ScanNet and demonstrates robust zero-shot capability through language-guided prototypes, while maintaining low model complexity (~2.02M parameters) and efficient training. This work broadens practical FS-SemSeg by reducing dependence on large pre-trained backbones and by leveraging multimodal guidance to improve cross-domain adaptability and edge-focused segmentation quality.
Abstract
Recent approaches for few-shot 3D point cloud semantic segmentation typically require a two-stage learning process, i.e., a pre-training stage followed by a few-shot training stage. While effective, these methods face overreliance on pre-training, which hinders model flexibility and adaptability. Some models tried to avoid pre-training yet failed to capture ample information. In addition, current approaches focus on visual information in the support set and neglect or do not fully exploit other useful data, such as textual annotations. This inadequate utilization of support information impairs the performance of the model and restricts its zero-shot ability. To address these limitations, we present a novel pre-training-free network, named Efficient Point Cloud Semantic Segmentation for Few- and Zero-shot scenarios. Our EPSegFZ incorporates three key components. A Prototype-Enhanced Registers Attention (ProERA) module and a Dual Relative Positional Encoding (DRPE)-based cross-attention mechanism for improved feature extraction and accurate query-prototype correspondence construction without pre-training. A Language-Guided Prototype Embedding (LGPE) module that effectively leverages textual information from the support set to improve few-shot performance and enable zero-shot inference. Extensive experiments show that our method outperforms the state-of-the-art method by 5.68% and 3.82% on the S3DIS and ScanNet benchmarks, respectively.
