Table of Contents
Fetching ...

CLIP3D-AD: Extending CLIP for 3D Few-Shot Anomaly Detection with Multi-View Images Generation

Zuo Zuo, Jiahao Dong, Yao Wu, Yanyun Qu, Zongze Wu

TL;DR

This paper proposes CLIP3D-AD, an efficient 3D-FSAD method extended on CLIP that successfully transfer strong generalization ability of CLIP into 3D-FSAD and designs multi-view fusion module to fuse features of multi-view images extracted by CLIP to facilitate visual representations for further enhancing vision-language correlation.

Abstract

Few-shot anomaly detection methods can effectively address data collecting difficulty in industrial scenarios. Compared to 2D few-shot anomaly detection (2D-FSAD), 3D few-shot anomaly detection (3D-FSAD) is still an unexplored but essential task. In this paper, we propose CLIP3D-AD, an efficient 3D-FSAD method extended on CLIP. We successfully transfer strong generalization ability of CLIP into 3D-FSAD. Specifically, we synthesize anomalous images on given normal images as sample pairs to adapt CLIP for 3D anomaly classification and segmentation. For classification, we introduce an image adapter and a text adapter to fine-tune global visual features and text features. Meanwhile, we propose a coarse-to-fine decoder to fuse and facilitate intermediate multi-layer visual representations of CLIP. To benefit from geometry information of point cloud and eliminate modality and data discrepancy when processed by CLIP, we project and render point cloud to multi-view normal and anomalous images. Then we design multi-view fusion module to fuse features of multi-view images extracted by CLIP which are used to facilitate visual representations for further enhancing vision-language correlation. Extensive experiments demonstrate that our method has a competitive performance of 3D few-shot anomaly classification and segmentation on MVTec-3D AD dataset.

CLIP3D-AD: Extending CLIP for 3D Few-Shot Anomaly Detection with Multi-View Images Generation

TL;DR

This paper proposes CLIP3D-AD, an efficient 3D-FSAD method extended on CLIP that successfully transfer strong generalization ability of CLIP into 3D-FSAD and designs multi-view fusion module to fuse features of multi-view images extracted by CLIP to facilitate visual representations for further enhancing vision-language correlation.

Abstract

Few-shot anomaly detection methods can effectively address data collecting difficulty in industrial scenarios. Compared to 2D few-shot anomaly detection (2D-FSAD), 3D few-shot anomaly detection (3D-FSAD) is still an unexplored but essential task. In this paper, we propose CLIP3D-AD, an efficient 3D-FSAD method extended on CLIP. We successfully transfer strong generalization ability of CLIP into 3D-FSAD. Specifically, we synthesize anomalous images on given normal images as sample pairs to adapt CLIP for 3D anomaly classification and segmentation. For classification, we introduce an image adapter and a text adapter to fine-tune global visual features and text features. Meanwhile, we propose a coarse-to-fine decoder to fuse and facilitate intermediate multi-layer visual representations of CLIP. To benefit from geometry information of point cloud and eliminate modality and data discrepancy when processed by CLIP, we project and render point cloud to multi-view normal and anomalous images. Then we design multi-view fusion module to fuse features of multi-view images extracted by CLIP which are used to facilitate visual representations for further enhancing vision-language correlation. Extensive experiments demonstrate that our method has a competitive performance of 3D few-shot anomaly classification and segmentation on MVTec-3D AD dataset.

Paper Structure

This paper contains 28 sections, 18 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Examples of 3D anomaly detection on MVTEC-3D AD. The second and third rows are point clouds and RGB images. The fourth row is our prediction of anomaly areas and the last row is ground truth mask.
  • Figure 2: The framework of CLIP3D-AD. In the training phase, anomaly image is synthesized as negative sample for given normal image as positive sample. We use frozen CLIP image encoder $f(\cdot)$ to extract global and local visual features and frozen CLIP text encoder $g(\cdot)$ to extract normal and anomaly text features. Then we introduce image adapter $A_f(\cdot)$ and two text adapters $A_{cg}(\cdot)$ and $A_{sg}(\cdot)$ to adapt original representations of CLIP. Meanwhile, we project and render point cloud to multi-view images and use multi-view fusion module to fuse multi-view visual features extracted by CLIP. We use fused multi-view features to enhance visual representations.
  • Figure 3: Pipeline of anomalies generation.
  • Figure 4: Example of generated multi-view images.
  • Figure 5: Architecture of multi-view fusion module consisting of multi-view global fusion and multi-view local fusion.
  • ...and 2 more figures