Table of Contents
Fetching ...

SNAP: Towards Segmenting Anything in Any Point Cloud

Aniket Gupta, Hanhui Wang, Charles Saunders, Aruni RoyChowdhury, Hanumant Singh, Huaizu Jiang

TL;DR

SNAP addresses the absence of a general-purpose interactive 3D segmentation tool by unifying spatial and text-based prompts in a single architecture and training it on seven diverse datasets with domain-aware normalization. It combines a Point Transformer-based encoder, a spatial-prompted masking module, and a text-prompt alignment through CLIP embeddings to deliver panoptic and open-vocabulary segmentation across indoor, outdoor, and aerial domains. The approach includes an automatic prompt-generation mechanism for text prompts and a multi-term loss to supervise masks, confidence, and semantic alignment, achieving state-of-the-art zero-shot results in spatial-prompted tasks and competitive performance in text-prompted tasks. This cross-domain generalization and prompt flexibility make SNAP a practical, scalable tool for 3D annotation, with potential for further improvements via self-supervised learning on unlabeled data.

Abstract

Interactive 3D point cloud segmentation enables efficient annotation of complex 3D scenes through user-guided prompts. However, current approaches are typically restricted in scope to a single domain (indoor or outdoor), and to a single form of user interaction (either spatial clicks or textual prompts). Moreover, training on multiple datasets often leads to negative transfer, resulting in domain-specific tools that lack generalizability. To address these limitations, we present \textbf{SNAP} (\textbf{S}egment a\textbf{N}ything in \textbf{A}ny \textbf{P}oint cloud), a unified model for interactive 3D segmentation that supports both point-based and text-based prompts across diverse domains. Our approach achieves cross-domain generalizability by training on 7 datasets spanning indoor, outdoor, and aerial environments, while employing domain-adaptive normalization to prevent negative transfer. For text-prompted segmentation, we automatically generate mask proposals without human intervention and match them against CLIP embeddings of textual queries, enabling both panoptic and open-vocabulary segmentation. Extensive experiments demonstrate that SNAP consistently delivers high-quality segmentation results. We achieve state-of-the-art performance on 8 out of 9 zero-shot benchmarks for spatial-prompted segmentation and demonstrate competitive results on all 5 text-prompted benchmarks. These results show that a unified model can match or exceed specialized domain-specific approaches, providing a practical tool for scalable 3D annotation. Project page is at, https://neu-vi.github.io/SNAP/

SNAP: Towards Segmenting Anything in Any Point Cloud

TL;DR

SNAP addresses the absence of a general-purpose interactive 3D segmentation tool by unifying spatial and text-based prompts in a single architecture and training it on seven diverse datasets with domain-aware normalization. It combines a Point Transformer-based encoder, a spatial-prompted masking module, and a text-prompt alignment through CLIP embeddings to deliver panoptic and open-vocabulary segmentation across indoor, outdoor, and aerial domains. The approach includes an automatic prompt-generation mechanism for text prompts and a multi-term loss to supervise masks, confidence, and semantic alignment, achieving state-of-the-art zero-shot results in spatial-prompted tasks and competitive performance in text-prompted tasks. This cross-domain generalization and prompt flexibility make SNAP a practical, scalable tool for 3D annotation, with potential for further improvements via self-supervised learning on unlabeled data.

Abstract

Interactive 3D point cloud segmentation enables efficient annotation of complex 3D scenes through user-guided prompts. However, current approaches are typically restricted in scope to a single domain (indoor or outdoor), and to a single form of user interaction (either spatial clicks or textual prompts). Moreover, training on multiple datasets often leads to negative transfer, resulting in domain-specific tools that lack generalizability. To address these limitations, we present \textbf{SNAP} (\textbf{S}egment a\textbf{N}ything in \textbf{A}ny \textbf{P}oint cloud), a unified model for interactive 3D segmentation that supports both point-based and text-based prompts across diverse domains. Our approach achieves cross-domain generalizability by training on 7 datasets spanning indoor, outdoor, and aerial environments, while employing domain-adaptive normalization to prevent negative transfer. For text-prompted segmentation, we automatically generate mask proposals without human intervention and match them against CLIP embeddings of textual queries, enabling both panoptic and open-vocabulary segmentation. Extensive experiments demonstrate that SNAP consistently delivers high-quality segmentation results. We achieve state-of-the-art performance on 8 out of 9 zero-shot benchmarks for spatial-prompted segmentation and demonstrate competitive results on all 5 text-prompted benchmarks. These results show that a unified model can match or exceed specialized domain-specific approaches, providing a practical tool for scalable 3D annotation. Project page is at, https://neu-vi.github.io/SNAP/

Paper Structure

This paper contains 34 sections, 7 equations, 20 figures, 17 tables, 1 algorithm.

Figures (20)

  • Figure 1: Comparison of models on IoU@1 Click across multiple domains. SNAP is a unified interactive point cloud segmentation model trained on multiple datasets spanning multiple domains. It generalizes robustly across a wide array of benchmarks.
  • Figure 2: Overview of SNAP. SNAP encodes point clouds and prompts separately, then uses a Mask Decoder to generate segmentation masks. Text prompts are handled by matching CLIP embeddings with predicted mask embeddings for semantic classification.
  • Figure 3: Point clouds from different domains vary significantly in their properties.(a) STPLS3D provides dense point clouds from an aerial view using RGB photogrammetry in the 50m range, (b) KITTI provides lidar data in the 150m range, (c) ScanNet provides point clouds in the 10m range.
  • Figure 4: Dataset Norm vs Domain Norm. Domain-norm simplifies the overall architecture and improves zero-shot generalization.
  • Figure 5: Qualitative segmentation results of open-set scene understanding on the ScanNet++ Dataset. Given a text prompt in the format of “ Segment {open-set vocabulary}”, our SNAP model finds the corresponding masks in the scenes.
  • ...and 15 more figures