Table of Contents
Fetching ...

Back to Point: Exploring Point-Language Models for Zero-Shot 3D Anomaly Detection

Kaiqiang Li, Gang Li, Mingle Zhou, Min Li, Delong Han, Jin Wan

Abstract

Zero-shot (ZS) 3D anomaly detection is crucial for reliable industrial inspection, as it enables detecting and localizing defects without requiring any target-category training data. Existing approaches render 3D point clouds into 2D images and leverage pre-trained Vision-Language Models (VLMs) for anomaly detection. However, such strategies inevitably discard geometric details and exhibit limited sensitivity to local anomalies. In this paper, we revisit intrinsic 3D representations and explore the potential of pre-trained Point-Language Models (PLMs) for ZS 3D anomaly detection. We propose BTP (Back To Point), a novel framework that effectively aligns 3D point cloud and textual embeddings. Specifically, BTP aligns multi-granularity patch features with textual representations for localized anomaly detection, while incorporating geometric descriptors to enhance sensitivity to structural anomalies. Furthermore, we introduce a joint representation learning strategy that leverages auxiliary point cloud data to improve robustness and enrich anomaly semantics. Extensive experiments on Real3D-AD and Anomaly-ShapeNet demonstrate that BTP achieves superior performance in ZS 3D anomaly detection. Code will be available at \href{https://github.com/wistful-8029/BTP-3DAD}{https://github.com/wistful-8029/BTP-3DAD}.

Back to Point: Exploring Point-Language Models for Zero-Shot 3D Anomaly Detection

Abstract

Zero-shot (ZS) 3D anomaly detection is crucial for reliable industrial inspection, as it enables detecting and localizing defects without requiring any target-category training data. Existing approaches render 3D point clouds into 2D images and leverage pre-trained Vision-Language Models (VLMs) for anomaly detection. However, such strategies inevitably discard geometric details and exhibit limited sensitivity to local anomalies. In this paper, we revisit intrinsic 3D representations and explore the potential of pre-trained Point-Language Models (PLMs) for ZS 3D anomaly detection. We propose BTP (Back To Point), a novel framework that effectively aligns 3D point cloud and textual embeddings. Specifically, BTP aligns multi-granularity patch features with textual representations for localized anomaly detection, while incorporating geometric descriptors to enhance sensitivity to structural anomalies. Furthermore, we introduce a joint representation learning strategy that leverages auxiliary point cloud data to improve robustness and enrich anomaly semantics. Extensive experiments on Real3D-AD and Anomaly-ShapeNet demonstrate that BTP achieves superior performance in ZS 3D anomaly detection. Code will be available at \href{https://github.com/wistful-8029/BTP-3DAD}{https://github.com/wistful-8029/BTP-3DAD}.
Paper Structure (17 sections, 8 equations, 3 figures, 5 tables)

This paper contains 17 sections, 8 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Comparison between VLM-based and PLM-based (ours) zero-shot 3D anomaly detection. (a): VLM-based approaches rely on multi-view rendering and back-projection, making their anomaly localization performance sensitive to the number and angles of rendered views. (b): Our proposed PLM-based approach directly processes point clouds, avoiding such view-dependent limitations and achieving more accurate 3D anomaly localization.
  • Figure 2: Overview of the BTP. The input point cloud is first processed by the 3D encoder and the Geometric Feature Creation Module (GFCM) to extract implicit semantic features and explicit geometric descriptors, respectively. The semantic features include patch features, global embeddings, and the CLS token. To enable anomaly localization, patch features, the CLS token, and geometric descriptors are feed into the Multi-Granularity Feature Embedding Module (MGFEM) to generate patch-level embeddings, which are compared with text embeddings for point-level anomaly detection. Global embeddings are aligned with text embeddings for object-level anomaly detection. A joint loss integrates geometric, local, and global supervision signals to jointly optimize GFCM, MGFEM, and the learnable text prompts.
  • Figure 3: Visualization of anomaly localization on Real3D-AD.