Table of Contents
Fetching ...

Text-Guided Variational Image Generation for Industrial Anomaly Detection and Segmentation

Mingyu Lee, Jongwon Choi

TL;DR

The paper tackles data scarcity in industrial anomaly detection by introducing a text-guided variational image generation framework that synthesizes non-defective images aligned with textual and visual priors. It combines a keyword-to-prompt generator, a variance-aware extension of VQGAN, and a text-guided knowledge integrator to produce diverse, status-consistent non-defective data that preserve variance. Across MVTECAD, BTAD, and MVTEC-LOCO AD, the approach yields substantial improvements in detection and segmentation, especially in one-shot and few-shot settings, and generalizes across multiple baselines. The work highlights the importance of modeling latent variance and semantic alignment in data augmentation for robust anomaly detection with limited real non-defective data, offering a practical path for industrial deployments.

Abstract

We propose a text-guided variational image generation method to address the challenge of getting clean data for anomaly detection in industrial manufacturing. Our method utilizes text information about the target object, learned from extensive text library documents, to generate non-defective data images resembling the input image. The proposed framework ensures that the generated non-defective images align with anticipated distributions derived from textual and image-based knowledge, ensuring stability and generality. Experimental results demonstrate the effectiveness of our approach, surpassing previous methods even with limited non-defective data. Our approach is validated through generalization tests across four baseline models and three distinct datasets. We present an additional analysis to enhance the effectiveness of anomaly detection models by utilizing the generated images.

Text-Guided Variational Image Generation for Industrial Anomaly Detection and Segmentation

TL;DR

The paper tackles data scarcity in industrial anomaly detection by introducing a text-guided variational image generation framework that synthesizes non-defective images aligned with textual and visual priors. It combines a keyword-to-prompt generator, a variance-aware extension of VQGAN, and a text-guided knowledge integrator to produce diverse, status-consistent non-defective data that preserve variance. Across MVTECAD, BTAD, and MVTEC-LOCO AD, the approach yields substantial improvements in detection and segmentation, especially in one-shot and few-shot settings, and generalizes across multiple baselines. The work highlights the importance of modeling latent variance and semantic alignment in data augmentation for robust anomaly detection with limited real non-defective data, offering a practical path for industrial deployments.

Abstract

We propose a text-guided variational image generation method to address the challenge of getting clean data for anomaly detection in industrial manufacturing. Our method utilizes text information about the target object, learned from extensive text library documents, to generate non-defective data images resembling the input image. The proposed framework ensures that the generated non-defective images align with anticipated distributions derived from textual and image-based knowledge, ensuring stability and generality. Experimental results demonstrate the effectiveness of our approach, surpassing previous methods even with limited non-defective data. Our approach is validated through generalization tests across four baseline models and three distinct datasets. We present an additional analysis to enhance the effectiveness of anomaly detection models by utilizing the generated images.
Paper Structure (32 sections, 7 equations, 18 figures, 18 tables)

This paper contains 32 sections, 7 equations, 18 figures, 18 tables.

Figures (18)

  • Figure 1: Comparison with state-of-the-art baselines. Our method generates non-defective images using a text-guided variational image generation method and utilizes the generated images as additional training data for anomaly detection. Ours outperforms the state-of-the-art methods across various settings, such as one-shot, few-shot (5 images), and full-shot training images. For comparison, we use the metal-nut class of MVTecAD dataset ref:mvtec.
  • Figure 2: Correlation of hypothesis. We repeat the tests with different images to confirm our hypothesis. Performance enhancement is strongly connected with the similarity between the generated and original images, and the visual variance of created images also improves performance.
  • Figure 3: Generated images in the preliminary experiment. (a) The original image of hazelnut in MVtecAD dataset. (b) The image retrieved by a keyword of 'hazelnut' from the web. (c),(d) The images generated using Midjourney and DALL-E, respectively, using a captioning of the original image as a prompt. (e),(f),(g) The images generated using the VQGAN-CLIP model based on 'hazelnut', 'A photo of a hazelnut', and the captioning of the original image, respectively. (h) The image generated by our method.
  • Figure 4: Overview of our framework. Our framework comprises a keyword-to-prompt generator, a variance-aware image generator, and a text-guided knowledge integrator. The keyword-to-prompts generator creates prompts from key input words and selects the best one that matches an input image. A variance-aware image generator creates non-defective images, encoding their visual features into a normal distribution to maintain variance. Our process updates through iteration, and a text-guided knowledge integrator selects the optimal images by comparing the similarity of their latent distribution to the text prompts.
  • Figure 5: Generalization test for anomaly detection in MVTecAD dataset. The first row shows the average improving score across different baselines and varying numbers of non-defective images. The second rows present the average score for the highest-improving five classes.
  • ...and 13 more figures