Table of Contents
Fetching ...

3D-PointZshotS: Geometry-Aware 3D Point Cloud Zero-Shot Semantic Segmentation Narrowing the Visual-Semantic Gap

Minmin Yang, Huantao Ren, Senem Velipasalar

TL;DR

This work tackles zero-shot semantic segmentation in 3D point clouds by bridging the semantic-visual gap through Latent Geometric Prototypes (LGPs). It introduces a geometry-consistency generator that uses cross-attention with LGPs and an InfoNCE-based self-consistency loss, and re-represents both visual and semantic features in a shared geometric space for robust alignment. The method leverages a three-step training pipeline and an inference scheme based on similarity in the LGP space, enabling effective transfer to unseen classes. Experiments on S3DIS, ScanNet, and SemanticKITTI demonstrate state-of-the-art harmonic mIoU, validating the importance of geometry-aware generation and cross-modal alignment for 3D ZSS.

Abstract

Existing zero-shot 3D point cloud segmentation methods often struggle with limited transferability from seen classes to unseen classes and from semantic to visual space. To alleviate this, we introduce 3D-PointZshotS, a geometry-aware zero-shot segmentation framework that enhances both feature generation and alignment using latent geometric prototypes (LGPs). Specifically, we integrate LGPs into a generator via a cross-attention mechanism, enriching semantic features with fine-grained geometric details. To further enhance stability and generalization, we introduce a self-consistency loss, which enforces feature robustness against point-wise perturbations. Additionally, we re-represent visual and semantic features in a shared space, bridging the semantic-visual gap and facilitating knowledge transfer to unseen classes. Experiments on three real-world datasets, namely ScanNet, SemanticKITTI, and S3DIS, demonstrate that our method achieves superior performance over four baselines in terms of harmonic mIoU. The code is available at \href{https://github.com/LexieYang/3D-PointZshotS}{Github}.

3D-PointZshotS: Geometry-Aware 3D Point Cloud Zero-Shot Semantic Segmentation Narrowing the Visual-Semantic Gap

TL;DR

This work tackles zero-shot semantic segmentation in 3D point clouds by bridging the semantic-visual gap through Latent Geometric Prototypes (LGPs). It introduces a geometry-consistency generator that uses cross-attention with LGPs and an InfoNCE-based self-consistency loss, and re-represents both visual and semantic features in a shared geometric space for robust alignment. The method leverages a three-step training pipeline and an inference scheme based on similarity in the LGP space, enabling effective transfer to unseen classes. Experiments on S3DIS, ScanNet, and SemanticKITTI demonstrate state-of-the-art harmonic mIoU, validating the importance of geometry-aware generation and cross-modal alignment for 3D ZSS.

Abstract

Existing zero-shot 3D point cloud segmentation methods often struggle with limited transferability from seen classes to unseen classes and from semantic to visual space. To alleviate this, we introduce 3D-PointZshotS, a geometry-aware zero-shot segmentation framework that enhances both feature generation and alignment using latent geometric prototypes (LGPs). Specifically, we integrate LGPs into a generator via a cross-attention mechanism, enriching semantic features with fine-grained geometric details. To further enhance stability and generalization, we introduce a self-consistency loss, which enforces feature robustness against point-wise perturbations. Additionally, we re-represent visual and semantic features in a shared space, bridging the semantic-visual gap and facilitating knowledge transfer to unseen classes. Experiments on three real-world datasets, namely ScanNet, SemanticKITTI, and S3DIS, demonstrate that our method achieves superior performance over four baselines in terms of harmonic mIoU. The code is available at \href{https://github.com/LexieYang/3D-PointZshotS}{Github}.

Paper Structure

This paper contains 14 sections, 2 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Three-step training procedure: (1) Feature extractor pre-training on seen classes, (2) Geometric consistency-aware generator training on seen class data. $Q$, $K$, and $V$ represent queries, keys and values, respectively. (3) Visual-semantic alignment via LGPs.
  • Figure 2: Qualitative comparison on three datasets: the first column shows the ground truth; the second, 3DGenZ predictions; the third, SV-Seg predictions; and the fourth, our predictions. Red rectangles indicate regions where our method performs better. Best viewed when zoomed in.
  • Figure 3: The distribution of LGP weights from visual and semantic features.