AnomalyGPT: Detecting Industrial Anomalies Using Large Vision-Language Models

Zhaopeng Gu; Bingke Zhu; Guibo Zhu; Yingying Chen; Ming Tang; Jinqiao Wang

AnomalyGPT: Detecting Industrial Anomalies Using Large Vision-Language Models

Zhaopeng Gu, Bingke Zhu, Guibo Zhu, Yingying Chen, Ming Tang, Jinqiao Wang

TL;DR

AnomalyGPT introduces a threshold-free, conversational LVLM-based framework for industrial anomaly detection that jointly detects presence, localizes anomalies at pixel level, and supports multi-turn dialogue. It blends a light-weight decoder with a prompt-learner to feed anomaly-aware prompts into an LVLM, guided by simulated anomaly data and Poisson editing-based image synthesis. The method achieves strong accuracy and localization metrics in unsupervised and few-shot settings on MVTec-AD and VisA, outperforming several baselines and enabling rapid adaptation to unseen categories. This approach offers practical gains for real-world IAD by eliminating manual thresholds and enabling interactive inspection and guidance.

Abstract

Large Vision-Language Models (LVLMs) such as MiniGPT-4 and LLaVA have demonstrated the capability of understanding images and achieved remarkable performance in various visual tasks. Despite their strong abilities in recognizing common objects due to extensive training datasets, they lack specific domain knowledge and have a weaker understanding of localized details within objects, which hinders their effectiveness in the Industrial Anomaly Detection (IAD) task. On the other hand, most existing IAD methods only provide anomaly scores and necessitate the manual setting of thresholds to distinguish between normal and abnormal samples, which restricts their practical implementation. In this paper, we explore the utilization of LVLM to address the IAD problem and propose AnomalyGPT, a novel IAD approach based on LVLM. We generate training data by simulating anomalous images and producing corresponding textual descriptions for each image. We also employ an image decoder to provide fine-grained semantic and design a prompt learner to fine-tune the LVLM using prompt embeddings. Our AnomalyGPT eliminates the need for manual threshold adjustments, thus directly assesses the presence and locations of anomalies. Additionally, AnomalyGPT supports multi-turn dialogues and exhibits impressive few-shot in-context learning capabilities. With only one normal shot, AnomalyGPT achieves the state-of-the-art performance with an accuracy of 86.1%, an image-level AUC of 94.1%, and a pixel-level AUC of 95.3% on the MVTec-AD dataset. Code is available at https://github.com/CASIA-IVA-Lab/AnomalyGPT.

AnomalyGPT: Detecting Industrial Anomalies Using Large Vision-Language Models

TL;DR

Abstract

Paper Structure (16 sections, 6 equations, 14 figures, 7 tables)

This paper contains 16 sections, 6 equations, 14 figures, 7 tables.

Introduction
Related Work
Method
Model Architecture
Decoder and Prompt Learner
Data for Image-Text Alignment
Loss Functions
Experiments
Quantitative Results
Qualitative Examples
Ablation Studies
Conclusion
More Experimental Results of Existing IAD Methods
Normal and Abnormal Texts
Detailed Image Description
...and 1 more sections

Figures (14)

Figure 1: Comparison between our AnomalyGPT, existing IAD methods and existing LVLMs. Existing IAD methods can only provide anomaly scores and need manually threshold setting, while existing LVLMs cannot detect anomalies in the image. AnomalyGPT can not only provide information about the image but also indicate the presence and location of anomaly.
Figure 2: The architecture of AnomalyGPT. The query image is passed to the frozen image encoder and the patch-level features extracted from intermediate layers are fed into image decoder to compute their similarity with normal and abnormal texts to obtain localization result. The final features extracted by the image encoder are fed to a linear layer and then passed to the prompt learner along with the localization result. The prompt learner converts them into prompt embeddings suitable for input into the LLM together with user text inputs. In few-shot setting, the patch-level features from normal samples are stored in memory banks and the localization result can be obtained by calculating the distance between query patches and their most similar counterparts in the memory bank.
Figure 3: Illustration of the comparison between cut-paste and poisson image editing. The results of cut-paste exhibit evident discontinuities and the results of poisson image editing are more natural.
Figure 4: Illustration of the $3\times 3$ grid of image, which is used to let LLM verbally indicate the abnormal position.
Figure 5: Qualitative example of AnomalyGPT in the unsupervised setting. AnomalyGPT is capable of detecting anomaly, pinpointing its location, providing pixel-level localization results and answering questions about the image.
...and 9 more figures

AnomalyGPT: Detecting Industrial Anomalies Using Large Vision-Language Models

TL;DR

Abstract

AnomalyGPT: Detecting Industrial Anomalies Using Large Vision-Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (14)