Table of Contents
Fetching ...

PatchEAD: Unifying Industrial Visual Prompting Frameworks for Patch-Exclusive Anomaly Detection

Po-Han Huang, Jeng-Lin Li, Po-Hsuan Huang, Ming-Ching Chang, Wei-Chao Chen

TL;DR

This work tackles industrial anomaly detection by removing reliance on textual prompts and multi-modal cues. It introduces PatchEAD, a patch-centric, training-free framework that leverages a frozen vision encoder to produce patch-level embeddings and a cross-patch scoring scheme that detects anomalies through within-batch or few-shot normal references. The approach demonstrates strong performance across seven industrial datasets in both few-shot and batch zero-shot settings, and gains further improvements with PatchEAD$^+$ through alignment and attention-based masking. The results indicate that a unified, vision-only pipeline with backbone-agnostic patch similarity enables rapid, calibration-light deployment in dynamic industrial environments, with practical implications for robust visual inspection.

Abstract

Industrial anomaly detection is increasingly relying on foundation models, aiming for strong out-of-distribution generalization and rapid adaptation in real-world deployments. Notably, past studies have primarily focused on textual prompt tuning, leaving the intrinsic visual counterpart fragmented into processing steps specific to each foundation model. We aim to address this limitation by proposing a unified patch-focused framework, Patch-Exclusive Anomaly Detection (PatchEAD), enabling training-free anomaly detection that is compatible with diverse foundation models. The framework constructs visual prompting techniques, including an alignment module and foreground masking. Our experiments show superior few-shot and batch zero-shot performance compared to prior work, despite the absence of textual features. Our study further examines how backbone structure and pretrained characteristics affect patch-similarity robustness, providing actionable guidance for selecting and configuring foundation models for real-world visual inspection. These results confirm that a well-unified patch-only framework can enable quick, calibration-light deployment without the need for carefully engineered textual prompts.

PatchEAD: Unifying Industrial Visual Prompting Frameworks for Patch-Exclusive Anomaly Detection

TL;DR

This work tackles industrial anomaly detection by removing reliance on textual prompts and multi-modal cues. It introduces PatchEAD, a patch-centric, training-free framework that leverages a frozen vision encoder to produce patch-level embeddings and a cross-patch scoring scheme that detects anomalies through within-batch or few-shot normal references. The approach demonstrates strong performance across seven industrial datasets in both few-shot and batch zero-shot settings, and gains further improvements with PatchEAD through alignment and attention-based masking. The results indicate that a unified, vision-only pipeline with backbone-agnostic patch similarity enables rapid, calibration-light deployment in dynamic industrial environments, with practical implications for robust visual inspection.

Abstract

Industrial anomaly detection is increasingly relying on foundation models, aiming for strong out-of-distribution generalization and rapid adaptation in real-world deployments. Notably, past studies have primarily focused on textual prompt tuning, leaving the intrinsic visual counterpart fragmented into processing steps specific to each foundation model. We aim to address this limitation by proposing a unified patch-focused framework, Patch-Exclusive Anomaly Detection (PatchEAD), enabling training-free anomaly detection that is compatible with diverse foundation models. The framework constructs visual prompting techniques, including an alignment module and foreground masking. Our experiments show superior few-shot and batch zero-shot performance compared to prior work, despite the absence of textual features. Our study further examines how backbone structure and pretrained characteristics affect patch-similarity robustness, providing actionable guidance for selecting and configuring foundation models for real-world visual inspection. These results confirm that a well-unified patch-only framework can enable quick, calibration-light deployment without the need for carefully engineered textual prompts.

Paper Structure

This paper contains 15 sections, 4 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Comparison of patch-only (PatchEAD) and multi-modality industrial anomaly detection methods. We investigate whether a purely visual approach can achieve competitive performance of multi-modality methods, and whether further optimization of vision prompts can enhance effectiveness.
  • Figure 2: The PatchEAD framework in few-shot and batch zero-shot settings: In few-shot, normal prompt and query images are passed through the backbone to extract patch embeddings, which are flattened to compute patch-wise cosine similarities. The highest patch anomaly score is used as the image-level score. In batch Zero-shot,each image is compared to the rest of the batch using leave-one-out similarity, with its image-level anomaly score computed as the average of these scores. The version with optional alignment and masking modules is denoted as PatchEAD$^+$.
  • Figure 3: Visualization of attention masks from different backbones across normal and abnormal images, demonstrating the ability to down-weight noisy regions.
  • Figure 4: Ablation study on the effect of the number of few-shot images in PatchEAD on image-level AUC for MVTec and VisA datasets.
  • Figure 5: Visualization of anomaly heatmaps generated by PatchEAD and PatchEAD$^+$ across seven industrial datasets, showing clearer and more precise defect outlines with reduced false positives near normal regions.