Table of Contents
Fetching ...

AnoStyler: Text-Driven Localized Anomaly Generation via Lightweight Style Transfer

Yulim So, Seokho Kang

TL;DR

This work tackles the scarcity and realism gap of anomaly data by introducing AnoStyler, a zero-shot approach that reframes anomaly generation as text-guided style transfer on a single normal image. It introduces a lightweight pipeline with shape-guided mask generation (Meta-Shape Priors), two-class prompt design, and text-driven stylization using a compact U-Net guided by frozen CLIP encoders, optimized with mask-aware losses. The method achieves state-of-the-art zero-shot anomaly generation and downstream anomaly detection on MVTec-AD and VisA, while maintaining significantly lower computational cost than diffusion-based baselines. Practically, AnoStyler offers a scalable, data-efficient path to synthesize realistic, semantically grounded anomalies for robust industrial defect detection without requiring large labeled anomaly sets.

Abstract

Anomaly generation has been widely explored to address the scarcity of anomaly images in real-world data. However, existing methods typically suffer from at least one of the following limitations, hindering their practical deployment: (1) lack of visual realism in generated anomalies; (2) dependence on large amounts of real images; and (3) use of memory-intensive, heavyweight model architectures. To overcome these limitations, we propose AnoStyler, a lightweight yet effective method that frames zero-shot anomaly generation as text-guided style transfer. Given a single normal image along with its category label and expected defect type, an anomaly mask indicating the localized anomaly regions and two-class text prompts representing the normal and anomaly states are generated using generalizable category-agnostic procedures. A lightweight U-Net model trained with CLIP-based loss functions is used to stylize the normal image into a visually realistic anomaly image, where anomalies are localized by the anomaly mask and semantically aligned with the text prompts. Extensive experiments on the MVTec-AD and VisA datasets show that AnoStyler outperforms existing anomaly generation methods in generating high-quality and diverse anomaly images. Furthermore, using these generated anomalies helps enhance anomaly detection performance.

AnoStyler: Text-Driven Localized Anomaly Generation via Lightweight Style Transfer

TL;DR

This work tackles the scarcity and realism gap of anomaly data by introducing AnoStyler, a zero-shot approach that reframes anomaly generation as text-guided style transfer on a single normal image. It introduces a lightweight pipeline with shape-guided mask generation (Meta-Shape Priors), two-class prompt design, and text-driven stylization using a compact U-Net guided by frozen CLIP encoders, optimized with mask-aware losses. The method achieves state-of-the-art zero-shot anomaly generation and downstream anomaly detection on MVTec-AD and VisA, while maintaining significantly lower computational cost than diffusion-based baselines. Practically, AnoStyler offers a scalable, data-efficient path to synthesize realistic, semantically grounded anomalies for robust industrial defect detection without requiring large labeled anomaly sets.

Abstract

Anomaly generation has been widely explored to address the scarcity of anomaly images in real-world data. However, existing methods typically suffer from at least one of the following limitations, hindering their practical deployment: (1) lack of visual realism in generated anomalies; (2) dependence on large amounts of real images; and (3) use of memory-intensive, heavyweight model architectures. To overcome these limitations, we propose AnoStyler, a lightweight yet effective method that frames zero-shot anomaly generation as text-guided style transfer. Given a single normal image along with its category label and expected defect type, an anomaly mask indicating the localized anomaly regions and two-class text prompts representing the normal and anomaly states are generated using generalizable category-agnostic procedures. A lightweight U-Net model trained with CLIP-based loss functions is used to stylize the normal image into a visually realistic anomaly image, where anomalies are localized by the anomaly mask and semantically aligned with the text prompts. Extensive experiments on the MVTec-AD and VisA datasets show that AnoStyler outperforms existing anomaly generation methods in generating high-quality and diverse anomaly images. Furthermore, using these generated anomalies helps enhance anomaly detection performance.

Paper Structure

This paper contains 31 sections, 12 equations, 10 figures, 10 tables, 3 algorithms.

Figures (10)

  • Figure 1: Examples of anomaly images generated by AnoStyler. Given a normal image and a category-defect pair $(\texttt{[c]},\texttt{[d]})$, our method generates visually realistic and semantically aligned anomalies.
  • Figure 2: Overall framework of AnoStyler. (1) Shape-Guided Mask Generation: A union of primitive masks $\mathbf{M}_1,\ldots,\mathbf{M}_m$ from Meta-Shape Priors is intersected with the foreground mask $\mathbf{M}^{fg}$ to obtain the anomaly mask $\mathbf{M}^a$. (2) Two-Class Prompt Generation: Structured text prompt templates are filled with the category-defect pair ([c], [d]) to form normal and anomaly prompt sets $\mathcal{T}^n$ and $\mathcal{T}^a$. (3) Text-Driven Localized Anomaly Generation: Guided by $\mathbf{M}^a$, $\mathcal{T}^n$, and $\mathcal{T}^a$, the stylization network $\mathcal{F}$ is trained to stylize the masked regions of the input image $\mathbf{I}^n$ as anomalies, resulting in the synthetic anomaly image $\textbf{I}^a$.
  • Figure 3: Comparison of generated anomaly images and their corresponding anomaly masks on MVTec-AD and VisA. AnoStyler generates visually realistic anomalies that align well with the masks.
  • Figure 4: Qualitative results corresponding to the loss configurations presented in Table \ref{['tab:loss_ablation']}. The first and second rows use category labels $\texttt{[c]} = \texttt{"wood"}$ and $\texttt{"chewinggum"}$, respectively, with defect type $\texttt{[d]} = \texttt{"scratch"}$.
  • Figure 5: Example masks generated by the proposed Meta-Shape Priors. Rows 1–3, 4-6, and 7-9 correspond to Line, Dot, and Freeform types, respectively.
  • ...and 5 more figures