SNAP: Towards Segmenting Anything in Any Point Cloud
Aniket Gupta, Hanhui Wang, Charles Saunders, Aruni RoyChowdhury, Hanumant Singh, Huaizu Jiang
TL;DR
SNAP addresses the absence of a general-purpose interactive 3D segmentation tool by unifying spatial and text-based prompts in a single architecture and training it on seven diverse datasets with domain-aware normalization. It combines a Point Transformer-based encoder, a spatial-prompted masking module, and a text-prompt alignment through CLIP embeddings to deliver panoptic and open-vocabulary segmentation across indoor, outdoor, and aerial domains. The approach includes an automatic prompt-generation mechanism for text prompts and a multi-term loss to supervise masks, confidence, and semantic alignment, achieving state-of-the-art zero-shot results in spatial-prompted tasks and competitive performance in text-prompted tasks. This cross-domain generalization and prompt flexibility make SNAP a practical, scalable tool for 3D annotation, with potential for further improvements via self-supervised learning on unlabeled data.
Abstract
Interactive 3D point cloud segmentation enables efficient annotation of complex 3D scenes through user-guided prompts. However, current approaches are typically restricted in scope to a single domain (indoor or outdoor), and to a single form of user interaction (either spatial clicks or textual prompts). Moreover, training on multiple datasets often leads to negative transfer, resulting in domain-specific tools that lack generalizability. To address these limitations, we present \textbf{SNAP} (\textbf{S}egment a\textbf{N}ything in \textbf{A}ny \textbf{P}oint cloud), a unified model for interactive 3D segmentation that supports both point-based and text-based prompts across diverse domains. Our approach achieves cross-domain generalizability by training on 7 datasets spanning indoor, outdoor, and aerial environments, while employing domain-adaptive normalization to prevent negative transfer. For text-prompted segmentation, we automatically generate mask proposals without human intervention and match them against CLIP embeddings of textual queries, enabling both panoptic and open-vocabulary segmentation. Extensive experiments demonstrate that SNAP consistently delivers high-quality segmentation results. We achieve state-of-the-art performance on 8 out of 9 zero-shot benchmarks for spatial-prompted segmentation and demonstrate competitive results on all 5 text-prompted benchmarks. These results show that a unified model can match or exceed specialized domain-specific approaches, providing a practical tool for scalable 3D annotation. Project page is at, https://neu-vi.github.io/SNAP/
