Table of Contents
Fetching ...

Towards Zero-shot Point Cloud Anomaly Detection: A Multi-View Projection Framework

Yuqi Cheng, Yunkang Cao, Guoyang Xie, Zhichao Lu, Weiming Shen

TL;DR

The Multi-View Projection framework is introduced, leveraging pre-trained Vision-Language Models (VLMs) to detect anomalies, and the integration of learnable visual and adaptive text prompting techniques to fine-tune these VLMs, thereby enhancing their detection performance.

Abstract

Detecting anomalies within point clouds is crucial for various industrial applications, but traditional unsupervised methods face challenges due to data acquisition costs, early-stage production constraints, and limited generalization across product categories. To overcome these challenges, we introduce the Multi-View Projection (MVP) framework, leveraging pre-trained Vision-Language Models (VLMs) to detect anomalies. Specifically, MVP projects point cloud data into multi-view depth images, thereby translating point cloud anomaly detection into image anomaly detection. Following zero-shot image anomaly detection methods, pre-trained VLMs are utilized to detect anomalies on these depth images. Given that pre-trained VLMs are not inherently tailored for zero-shot point cloud anomaly detection and may lack specificity, we propose the integration of learnable visual and adaptive text prompting techniques to fine-tune these VLMs, thereby enhancing their detection performance. Extensive experiments on the MVTec 3D-AD and Real3D-AD demonstrate our proposed MVP framework's superior zero-shot anomaly detection performance and the prompting techniques' effectiveness. Real-world evaluations on automotive plastic part inspection further showcase that the proposed method can also be generalized to practical unseen scenarios. The code is available at https://github.com/hustCYQ/MVP-PCLIP.

Towards Zero-shot Point Cloud Anomaly Detection: A Multi-View Projection Framework

TL;DR

The Multi-View Projection framework is introduced, leveraging pre-trained Vision-Language Models (VLMs) to detect anomalies, and the integration of learnable visual and adaptive text prompting techniques to fine-tune these VLMs, thereby enhancing their detection performance.

Abstract

Detecting anomalies within point clouds is crucial for various industrial applications, but traditional unsupervised methods face challenges due to data acquisition costs, early-stage production constraints, and limited generalization across product categories. To overcome these challenges, we introduce the Multi-View Projection (MVP) framework, leveraging pre-trained Vision-Language Models (VLMs) to detect anomalies. Specifically, MVP projects point cloud data into multi-view depth images, thereby translating point cloud anomaly detection into image anomaly detection. Following zero-shot image anomaly detection methods, pre-trained VLMs are utilized to detect anomalies on these depth images. Given that pre-trained VLMs are not inherently tailored for zero-shot point cloud anomaly detection and may lack specificity, we propose the integration of learnable visual and adaptive text prompting techniques to fine-tune these VLMs, thereby enhancing their detection performance. Extensive experiments on the MVTec 3D-AD and Real3D-AD demonstrate our proposed MVP framework's superior zero-shot anomaly detection performance and the prompting techniques' effectiveness. Real-world evaluations on automotive plastic part inspection further showcase that the proposed method can also be generalized to practical unseen scenarios. The code is available at https://github.com/hustCYQ/MVP-PCLIP.
Paper Structure (33 sections, 17 equations, 12 figures, 6 tables)

This paper contains 33 sections, 17 equations, 12 figures, 6 tables.

Figures (12)

  • Figure 1: Training schemes of different point cloud anomaly detection settings: (a) Vanilla point cloud anomaly detection and (b) Zero-shot point cloud anomaly detection. In contrast to vanilla point cloud anomaly detection, zero-shot point cloud anomaly detection leverages none samples from testing categories but annotated auxiliary data for training, aiming to build a unified model.
  • Figure 2: Framework of Multi-View Projection (MVP). The origin point cloud is projected to multi-view depth images $\boldsymbol{V}_i, 1 \le i \le N$, and each depth image is delivered to a zero-shot image anomaly detection model with shared weight. The anomalies are detected by integrating the output of vision-language models.
  • Figure 3: Comparisons between simple integration of the proposed MVP and existing zero-shot image anomaly detection methods and the proposed method MVP-PCLIP. (a) MVP-WinCLIP, (b) MVP-SAA, (c) MVP-APRIL-GAN, (d) MVP-PCLIP.
  • Figure 4: The projection of point cloud from multiple views.
  • Figure 5: The framework of the proposed MVP-PCLIP. The depth images projected from multiple perspectives are input into image encoder with visual prompt to extract image features. The text prompts are constructed by combining the learnable text prompts and predefined text prompts, and text features are extracted through the text encoder. Next, the point features are aggregated through the projection relationship between the point cloud and images. Finally, the similarity between points and text features is leveraged for object-wise and point-wise anomaly detection.
  • ...and 7 more figures