Table of Contents
Fetching ...

PointAD+: Learning Hierarchical Representations for Zero-shot 3D Anomaly Detection

Qihang Zhou, Shibo He, Jiangtao Yan, Wenchao Meng, Jiming Chen

TL;DR

This work tackles zero-shot 3D anomaly detection for unseen objects by transferring CLIP's 2D generalization to 3D. It introduces PointAD, which uses implicit point representations derived from renderings, and PointAD+, which adds explicit geometry-aware representations via G-aggregation and hierarchical representation learning. A cross-hierarchy contrastive alignment unifies rendering-based and geometry-based anomaly semantics, enabling robust 3D and multimodal (RGB-inclusive) detection without retraining CLIP. Extensive experiments across three datasets show state-of-the-art performance in ZS 3D and multimodal 3D anomaly detection, with thorough ablations validating module contributions and robustness.

Abstract

In this paper, we aim to transfer CLIP's robust 2D generalization capabilities to identify 3D anomalies across unseen objects of highly diverse class semantics. To this end, we propose a unified framework to comprehensively detect and segment 3D anomalies by leveraging both point- and pixel-level information. We first design PointAD, which leverages point-pixel correspondence to represent 3D anomalies through their associated rendering pixel representations. This approach is referred to as implicit 3D representation, as it focuses solely on rendering pixel anomalies but neglects the inherent spatial relationships within point clouds. Then, we propose PointAD+ to further broaden the interpretation of 3D anomalies by introducing explicit 3D representation, emphasizing spatial abnormality to uncover abnormal spatial relationships. Hence, we propose G-aggregation to involve geometry information to enable the aggregated point representations spatially aware. To simultaneously capture rendering and spatial abnormality, PointAD+ proposes hierarchical representation learning, incorporating implicit and explicit anomaly semantics into hierarchical text prompts: rendering prompts for the rendering layer and geometry prompts for the geometry layer. A cross-hierarchy contrastive alignment is further introduced to promote the interaction between the rendering and geometry layers, facilitating mutual anomaly learning. Finally, PointAD+ integrates anomaly semantics from both layers to capture the generalized anomaly semantics. During the test, PointAD+ can integrate RGB information in a plug-and-play manner and further improve its detection performance. Extensive experiments demonstrate the superiority of PointAD+ in ZS 3D anomaly detection across unseen objects with highly diverse class semantics, achieving a holistic understanding of abnormality.

PointAD+: Learning Hierarchical Representations for Zero-shot 3D Anomaly Detection

TL;DR

This work tackles zero-shot 3D anomaly detection for unseen objects by transferring CLIP's 2D generalization to 3D. It introduces PointAD, which uses implicit point representations derived from renderings, and PointAD+, which adds explicit geometry-aware representations via G-aggregation and hierarchical representation learning. A cross-hierarchy contrastive alignment unifies rendering-based and geometry-based anomaly semantics, enabling robust 3D and multimodal (RGB-inclusive) detection without retraining CLIP. Extensive experiments across three datasets show state-of-the-art performance in ZS 3D and multimodal 3D anomaly detection, with thorough ablations validating module contributions and robustness.

Abstract

In this paper, we aim to transfer CLIP's robust 2D generalization capabilities to identify 3D anomalies across unseen objects of highly diverse class semantics. To this end, we propose a unified framework to comprehensively detect and segment 3D anomalies by leveraging both point- and pixel-level information. We first design PointAD, which leverages point-pixel correspondence to represent 3D anomalies through their associated rendering pixel representations. This approach is referred to as implicit 3D representation, as it focuses solely on rendering pixel anomalies but neglects the inherent spatial relationships within point clouds. Then, we propose PointAD+ to further broaden the interpretation of 3D anomalies by introducing explicit 3D representation, emphasizing spatial abnormality to uncover abnormal spatial relationships. Hence, we propose G-aggregation to involve geometry information to enable the aggregated point representations spatially aware. To simultaneously capture rendering and spatial abnormality, PointAD+ proposes hierarchical representation learning, incorporating implicit and explicit anomaly semantics into hierarchical text prompts: rendering prompts for the rendering layer and geometry prompts for the geometry layer. A cross-hierarchy contrastive alignment is further introduced to promote the interaction between the rendering and geometry layers, facilitating mutual anomaly learning. Finally, PointAD+ integrates anomaly semantics from both layers to capture the generalized anomaly semantics. During the test, PointAD+ can integrate RGB information in a plug-and-play manner and further improve its detection performance. Extensive experiments demonstrate the superiority of PointAD+ in ZS 3D anomaly detection across unseen objects with highly diverse class semantics, achieving a holistic understanding of abnormality.

Paper Structure

This paper contains 33 sections, 18 equations, 20 figures, 15 tables.

Figures (20)

  • Figure 1: Motivation of zeor-shot 3D anomaly detection. Top: The bend in a dowel can be detected using both RGB information and point relations. Middle and Bottom: Challenges arise when RGB information alone misinterprets similar appearances, such as chocolates on cookies resembling hole anomalies or surface damage on a potato blending with the foreground’s color patterns. However, effective detection can be achieved by modeling point relations within point clouds.
  • Figure 2: Schematic for PointAD and PointAD+.
  • Figure 3: Visualization of explicit and implicit point score maps. Top row: point segmentation using explicit point learning; Middle row: point segmentation using implicit point learning; Bottom row: ground truths.
  • Figure 3: Performance comparison on ZS 3D anomaly detection in cross-dataset setting. Note that RealAD-3D could not compute the AUPRO.
  • Figure 4: Framework of PointAD+. PointAD+ interprets 3D abnormality through implicit and explicit 3D abnormality. For implicit 3D abnormality, CLIP's vision encoder extracts 2D global and local representations from the renderings, and then the resulting 2D representations are projected into implicit point representation to capture point anomaly semantics (rendering layer). For explicit 3D abnormality, we propose G-aggregation to obtain the explicit point representation by aggregating the implicit point representation and then incorporating geometric information (geometry layer). Holding these layers, cross-hierarchy alignment is further introduced to facilitate mutual learning across layers. Finally, hierarchical representation learning jointly optimizes the text embeddings to align explicit and implicit point representations with learnable rendering and geometric prompts, capturing generic anomaly patterns comprehensively.
  • ...and 15 more figures