Table of Contents
Fetching ...

GS-CLIP: Zero-shot 3D Anomaly Detection by Geometry-Aware Prompt and Synergistic View Representation Learning

Zehao Deng, An Liu, Yan Wang

TL;DR

This work proposes the Geometry-Aware Prompt and Synergistic View Representation Learning (GS-CLIP) framework, which enables the model to identify geometric anomalies through a two-stage learning process, and shows that GS-CLIP achieves superior performance in detection.

Abstract

Zero-shot 3D Anomaly Detection is an emerging task that aims to detect anomalies in a target dataset without any target training data, which is particularly important in scenarios constrained by sample scarcity and data privacy concerns. While current methods adapt CLIP by projecting 3D point clouds into 2D representations, they face challenges. The projection inherently loses some geometric details, and the reliance on a single 2D modality provides an incomplete visual understanding, limiting their ability to detect diverse anomaly types. To address these limitations, we propose the Geometry-Aware Prompt and Synergistic View Representation Learning (GS-CLIP) framework, which enables the model to identify geometric anomalies through a two-stage learning process. In stage 1, we dynamically generate text prompts embedded with 3D geometric priors. These prompts contain global shape context and local defect information distilled by our Geometric Defect Distillation Module (GDDM). In stage 2, we introduce Synergistic View Representation Learning architecture that processes rendered and depth images in parallel. A Synergistic Refinement Module (SRM) subsequently fuses the features of both streams, capitalizing on their complementary strengths. Comprehensive experimental results on four large-scale public datasets show that GS-CLIP achieves superior performance in detection. Code can be available at https://github.com/zhushengxinyue/GS-CLIP.

GS-CLIP: Zero-shot 3D Anomaly Detection by Geometry-Aware Prompt and Synergistic View Representation Learning

TL;DR

This work proposes the Geometry-Aware Prompt and Synergistic View Representation Learning (GS-CLIP) framework, which enables the model to identify geometric anomalies through a two-stage learning process, and shows that GS-CLIP achieves superior performance in detection.

Abstract

Zero-shot 3D Anomaly Detection is an emerging task that aims to detect anomalies in a target dataset without any target training data, which is particularly important in scenarios constrained by sample scarcity and data privacy concerns. While current methods adapt CLIP by projecting 3D point clouds into 2D representations, they face challenges. The projection inherently loses some geometric details, and the reliance on a single 2D modality provides an incomplete visual understanding, limiting their ability to detect diverse anomaly types. To address these limitations, we propose the Geometry-Aware Prompt and Synergistic View Representation Learning (GS-CLIP) framework, which enables the model to identify geometric anomalies through a two-stage learning process. In stage 1, we dynamically generate text prompts embedded with 3D geometric priors. These prompts contain global shape context and local defect information distilled by our Geometric Defect Distillation Module (GDDM). In stage 2, we introduce Synergistic View Representation Learning architecture that processes rendered and depth images in parallel. A Synergistic Refinement Module (SRM) subsequently fuses the features of both streams, capitalizing on their complementary strengths. Comprehensive experimental results on four large-scale public datasets show that GS-CLIP achieves superior performance in detection. Code can be available at https://github.com/zhushengxinyue/GS-CLIP.
Paper Structure (31 sections, 16 equations, 6 figures, 4 tables)

This paper contains 31 sections, 16 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Comparison of task settings between traditional Unsupervised 3D Anomaly Detection (U3DAD) and Zero-shot 3D Anomaly Detection (ZS3DAD). U3DAD is trained on positive (normal) samples and tested on samples of the same categories; ZS3DAD is trained on auxiliary, annotated data and tested on unseen target categories.
  • Figure 2: Example of the complementarity of rendered and depth images in anomaly detection. In the top row, the depth map effectively ignores surface texture interference to clearly show the dent anomaly; in the bottom row, the rendered image better captures the slight protrusion with insignificant depth change through lighting and shadow variations.
  • Figure 3: The overall architecture of GS-CLIP. The framework is optimized through a two-stage learning strategy. In stage 1, we generate text prompts embedded with geometric priors using a 3D feature extractor and a Geometric Defect Distillation Module. In stage 2, we design a synergistic architecture that processes rendered images and a LoRA-optimized depth image branch in parallel. The features from both branches are deeply fused by the Synergistic Refinement Module and finally compared with the text prompts to compute similarity for classification and segmentation.
  • Figure 4: Qualitative comparison of anomaly score map between PointAD and our method. (M) represents multimodal, which is the result of integrating RGB images.
  • Figure 5: Parameters in GDDM.
  • ...and 1 more figures