Table of Contents
Fetching ...

DMP-3DAD: Cross-Category 3D Anomaly Detection via Realistic Depth Map Projection with Few Normal Samples

Zi Wang, Katsuya Hotta, Koichiro Kamide, Yawen Zou, Jianjian Qin, Chao Zhang, Jun Yu

TL;DR

This work tackles cross-category 3D anomaly detection under few-shot constraints, where only a handful of normal samples are available. It introduces DMP-3DAD, a training-free pipeline that converts point clouds into a fixed set of realistic depth maps across multiple views and extracts features with a frozen CLIP visual encoder. Anomaly scores are computed by view-weighted feature similarity between test samples and normal references, enabling category-agnostic detection without any training or prompts. On ShapeNetPart, DMP-3DAD achieves state-of-the-art mean AUROC across 1-, 3-, and 5-shot settings, demonstrating strong generalization and practical applicability for cross-category 3D anomaly detection.

Abstract

Cross-category anomaly detection for 3D point clouds aims to determine whether an unseen object belongs to a target category using only a few normal examples. Most existing methods rely on category-specific training, which limits their flexibility in few-shot scenarios. In this paper, we propose DMP-3DAD, a training-free framework for cross-category 3D anomaly detection based on multi-view realistic depth map projection. Specifically, by converting point clouds into a fixed set of realistic depth images, our method leverages a frozen CLIP visual encoder to extract multi-view representations and performs anomaly detection via weighted feature similarity, which does not require any fine-tuning or category-dependent adaptation. Extensive experiments on the ShapeNetPart dataset demonstrate that DMP-3DAD achieves state-of-the-art performance under few-shot setting. The results show that the proposed approach provides a simple yet effective solution for practical cross-category 3D anomaly detection.

DMP-3DAD: Cross-Category 3D Anomaly Detection via Realistic Depth Map Projection with Few Normal Samples

TL;DR

This work tackles cross-category 3D anomaly detection under few-shot constraints, where only a handful of normal samples are available. It introduces DMP-3DAD, a training-free pipeline that converts point clouds into a fixed set of realistic depth maps across multiple views and extracts features with a frozen CLIP visual encoder. Anomaly scores are computed by view-weighted feature similarity between test samples and normal references, enabling category-agnostic detection without any training or prompts. On ShapeNetPart, DMP-3DAD achieves state-of-the-art mean AUROC across 1-, 3-, and 5-shot settings, demonstrating strong generalization and practical applicability for cross-category 3D anomaly detection.

Abstract

Cross-category anomaly detection for 3D point clouds aims to determine whether an unseen object belongs to a target category using only a few normal examples. Most existing methods rely on category-specific training, which limits their flexibility in few-shot scenarios. In this paper, we propose DMP-3DAD, a training-free framework for cross-category 3D anomaly detection based on multi-view realistic depth map projection. Specifically, by converting point clouds into a fixed set of realistic depth images, our method leverages a frozen CLIP visual encoder to extract multi-view representations and performs anomaly detection via weighted feature similarity, which does not require any fine-tuning or category-dependent adaptation. Extensive experiments on the ShapeNetPart dataset demonstrate that DMP-3DAD achieves state-of-the-art performance under few-shot setting. The results show that the proposed approach provides a simple yet effective solution for practical cross-category 3D anomaly detection.
Paper Structure (17 sections, 8 equations, 5 figures, 6 tables)

This paper contains 17 sections, 8 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Overview of proposed DMP-3DAD. Given reference point clouds and a test point cloud, we generate realistic multi-view projections using a fixed 3D view grid. The projected images are encoded by the CLIP visual encoder only. View-wise weighted similarities between reference and test features are computed and aggregated to produce the final anomaly score.
  • Figure 2: Multi-view camera configurations for point cloud projection with different numbers of views (5, 10, 20, and 30). Camera frustums indicate both viewpoint locations and viewing directions toward the object center.
  • Figure 3: Category-wise AUC-ROC (%) with varying numbers of reference samples. Thin dashed curves correspond to individual object categories, while the bold solid curve shows the mean performance averaged over all categories.
  • Figure 4: Category-wise standard deviation with varying numbers of reference samples. Thin dashed curves correspond to individual object categories, while the bold solid curve shows the mean performance averaged over all categories.
  • Figure 5: Failure cases of the proposed anomaly detection method. Green boxes indicate reference samples from the normal class, while red boxes denote anomalous samples from different classes that are incorrectly classified as normal.