Table of Contents
Fetching ...

Anomagic: Crossmodal Prompt-driven Zero-shot Anomaly Generation

Yuxin Jiang, Wei Luo, Hui Zhang, Qiyu Chen, Haiming Yao, Weiming Shen, Yunkang Cao

TL;DR

Anomagic addresses zero-shot anomaly generation by unifying visual and textual cues through crossmodal prompts and a region-aware CLIP guidance, enabling targeted inpainting-based synthesis of realistic anomalies. A contrastive anomaly mask refinement improves alignment between synthesized regions and masks. The AnomVerse dataset of 12,987 anomaly–mask–caption triplets enables robust training and zero-shot generalization, with results showing superior realism and boosted downstream anomaly detection performance across VisA and MVTec AD. The framework supports user-defined prompts and crossmodal control, positioning Anomagic as a versatile foundation model for anomaly generation.

Abstract

We propose Anomagic, a zero-shot anomaly generation method that produces semantically coherent anomalies without requiring any exemplar anomalies. By unifying both visual and textual cues through a crossmodal prompt encoding scheme, Anomagic leverages rich contextual information to steer an inpainting-based generation pipeline. A subsequent contrastive refinement strategy enforces precise alignment between synthesized anomalies and their masks, thereby bolstering downstream anomaly detection accuracy. To facilitate training, we introduce AnomVerse, a collection of 12,987 anomaly-mask-caption triplets assembled from 13 publicly available datasets, where captions are automatically generated by multimodal large language models using structured visual prompts and template-based textual hints. Extensive experiments demonstrate that Anomagic trained on AnomVerse can synthesize more realistic and varied anomalies than prior methods, yielding superior improvements in downstream anomaly detection. Furthermore, Anomagic can generate anomalies for any normal-category image using user-defined prompts, establishing a versatile foundation model for anomaly generation.

Anomagic: Crossmodal Prompt-driven Zero-shot Anomaly Generation

TL;DR

Anomagic addresses zero-shot anomaly generation by unifying visual and textual cues through crossmodal prompts and a region-aware CLIP guidance, enabling targeted inpainting-based synthesis of realistic anomalies. A contrastive anomaly mask refinement improves alignment between synthesized regions and masks. The AnomVerse dataset of 12,987 anomaly–mask–caption triplets enables robust training and zero-shot generalization, with results showing superior realism and boosted downstream anomaly detection performance across VisA and MVTec AD. The framework supports user-defined prompts and crossmodal control, positioning Anomagic as a versatile foundation model for anomaly generation.

Abstract

We propose Anomagic, a zero-shot anomaly generation method that produces semantically coherent anomalies without requiring any exemplar anomalies. By unifying both visual and textual cues through a crossmodal prompt encoding scheme, Anomagic leverages rich contextual information to steer an inpainting-based generation pipeline. A subsequent contrastive refinement strategy enforces precise alignment between synthesized anomalies and their masks, thereby bolstering downstream anomaly detection accuracy. To facilitate training, we introduce AnomVerse, a collection of 12,987 anomaly-mask-caption triplets assembled from 13 publicly available datasets, where captions are automatically generated by multimodal large language models using structured visual prompts and template-based textual hints. Extensive experiments demonstrate that Anomagic trained on AnomVerse can synthesize more realistic and varied anomalies than prior methods, yielding superior improvements in downstream anomaly detection. Furthermore, Anomagic can generate anomalies for any normal-category image using user-defined prompts, establishing a versatile foundation model for anomaly generation.

Paper Structure

This paper contains 20 sections, 5 equations, 6 figures, 3 tables, 1 algorithm.

Figures (6)

  • Figure 1: Overview of AnomVerse. (a) The construction pipeline for generating anomaly–mask–caption triplets. (b) The distribution across domains and the top 15 most frequent anomaly categories.
  • Figure 2: Overall framework of Anomagic. Our method employs a crossmodal prompt encoding to extract conditions from selected anomaly–mask–caption triplets in AnomVerse, which are then used to guide the inpainting process. During testing, a contrastive anomaly mask refinement module is introduced to further enhance the accuracy of the predicted anomaly masks.
  • Figure 3: Qualitative comparison of anomaly generation performance. Anomalies are highlighted with red circles. Unlike existing zero-shot methods (DRAEM and RealNet) and few-shot methods (AnoGen), Anomagic uniquely achieves both visually realistic anomaly synthesis and accurate anomaly mask generation.
  • Figure 4: t-SNE visualization of marginal group distributions for "candle" and "pcb3" objects from VisA.
  • Figure 5: Visualization of anomaly generation results under unimodal and crossmodal prompts. Our method effectively synthesizes realistic anomalies in both settings, with Anomagic-Cross producing notably superior results.
  • ...and 1 more figures