Table of Contents
Fetching ...

VLMDiff: Leveraging Vision-Language Models for Multi-Class Anomaly Detection with Diffusion

Samet Hicsonmez, Abd El Rahman Shabayek, Djamila Aouada

TL;DR

VLMDiff tackles unsupervised multi-class visual anomaly detection by conditioning a latent diffusion model on detailed captions produced by a Vision-Language Model. It eliminates per-class training and synthetic anomaly generation by using VLM-derived descriptions as training signals, guiding reconstruction of normal images across diverse categories. The approach achieves state-of-the-art pixel-level localization on Real-IAD and COCO-AD, outperforming diffusion-based baselines by up to 25 and 8 PRO points, and generalizes to real-world industrial data with a single model per dataset. The method's practical impact lies in scalable, robust anomaly detection for complex scenes without extensive labeled data.

Abstract

Detecting visual anomalies in diverse, multi-class real-world images is a significant challenge. We introduce \ours, a novel unsupervised multi-class visual anomaly detection framework. It integrates a Latent Diffusion Model (LDM) with a Vision-Language Model (VLM) for enhanced anomaly localization and detection. Specifically, a pre-trained VLM with a simple prompt extracts detailed image descriptions, serving as additional conditioning for LDM training. Current diffusion-based methods rely on synthetic noise generation, limiting their generalization and requiring per-class model training, which hinders scalability. \ours, however, leverages VLMs to obtain normal captions without manual annotations or additional training. These descriptions condition the diffusion model, learning a robust normal image feature representation for multi-class anomaly detection. Our method achieves competitive performance, improving the pixel-level Per-Region-Overlap (PRO) metric by up to 25 points on the Real-IAD dataset and 8 points on the COCO-AD dataset, outperforming state-of-the-art diffusion-based approaches. Code is available at https://github.com/giddyyupp/VLMDiff.

VLMDiff: Leveraging Vision-Language Models for Multi-Class Anomaly Detection with Diffusion

TL;DR

VLMDiff tackles unsupervised multi-class visual anomaly detection by conditioning a latent diffusion model on detailed captions produced by a Vision-Language Model. It eliminates per-class training and synthetic anomaly generation by using VLM-derived descriptions as training signals, guiding reconstruction of normal images across diverse categories. The approach achieves state-of-the-art pixel-level localization on Real-IAD and COCO-AD, outperforming diffusion-based baselines by up to 25 and 8 PRO points, and generalizes to real-world industrial data with a single model per dataset. The method's practical impact lies in scalable, robust anomaly detection for complex scenes without extensive labeled data.

Abstract

Detecting visual anomalies in diverse, multi-class real-world images is a significant challenge. We introduce \ours, a novel unsupervised multi-class visual anomaly detection framework. It integrates a Latent Diffusion Model (LDM) with a Vision-Language Model (VLM) for enhanced anomaly localization and detection. Specifically, a pre-trained VLM with a simple prompt extracts detailed image descriptions, serving as additional conditioning for LDM training. Current diffusion-based methods rely on synthetic noise generation, limiting their generalization and requiring per-class model training, which hinders scalability. \ours, however, leverages VLMs to obtain normal captions without manual annotations or additional training. These descriptions condition the diffusion model, learning a robust normal image feature representation for multi-class anomaly detection. Our method achieves competitive performance, improving the pixel-level Per-Region-Overlap (PRO) metric by up to 25 points on the Real-IAD dataset and 8 points on the COCO-AD dataset, outperforming state-of-the-art diffusion-based approaches. Code is available at https://github.com/giddyyupp/VLMDiff.

Paper Structure

This paper contains 17 sections, 2 equations, 13 figures, 16 tables.

Figures (13)

  • Figure 1: Comparison of (a) current diffusion-based approaches and (b) our approach. Our method extracts textual descriptions from VLMs to guide the training of the underlying diffusion model.
  • Figure 2: The processing pipeline of VLMDiff. During training (top), a normal (i.e. anomaly-free) image is fed to both 1) an off-the-shelf VLM (on the right) to extract the detailed description of the object using the query ($\mathcal{P}_D$), which is further encoded into condition vector $c$ using text encoder, and to 2) a pretrained image encoder (note that we finetune the image autoencoder using only normal images beforehand) to get the latent vector. Then, a diffusion process adds noise to the latent vector and learns to denoise it with the guidance coming from the condition vector. During inference (bottom), the same process as training is followed, except there is no text description is used to condition the diffusion model. The denoised latent code is fed to the pretrained image decoder to get the reconstructed image. Anomaly segmentation is done by finding the dissimilar locations on the feature maps of input and reconstructed normal images.
  • Figure 3: Example anomalous images from Real-IAD dataset, and predicted anomaly segmentation maps using VLMDiff.
  • Figure 4: Visual comparison of diffusion-based methods on Real-IAD dataset. DiAD detects the anomaly locations with the expense of having many false positives.
  • Figure 5: Visual comparison of diffusion-based methods on COCO-AD dataset.
  • ...and 8 more figures