Table of Contents
Fetching ...

GPT-4V-AD: Exploring Grounding Potential of VQA-oriented GPT-4V for Zero-shot Anomaly Detection

Jiangning Zhang, Haoyang He, Xuhai Chen, Zhucun Xue, Yabiao Wang, Chengjie Wang, Lei Xie, Yong Liu

TL;DR

The paper explores zero-shot visual anomaly detection using GPT-4V by formulating anomaly localization as a VQA-grounded task. It introduces the GPT-4V-AD framework with three components—Granular Region Division, Prompt Designing, and Text2Segmentation—to translate image regions into anomaly scores and segmentation maps. Empirical results on MVTec AD and VisA show competitive image- and pixel-level AU-ROC scores (e.g., 77.1/68.0 on MVTec AD and 88.0/76.6 on VisA) with notable gains on VisA, though gaps remain compared to state-of-the-art CLIP-based methods. The work establishes a baseline for VQA-oriented LVLMs in zero-shot AD and outlines concrete directions for improving grounding accuracy, stability, and efficiency in industrial anomaly detection tasks.

Abstract

Large Multimodal Model (LMM) GPT-4V(ision) endows GPT-4 with visual grounding capabilities, making it possible to handle certain tasks through the Visual Question Answering (VQA) paradigm. This paper explores the potential of VQA-oriented GPT-4V in the recently popular visual Anomaly Detection (AD) and is the first to conduct qualitative and quantitative evaluations on the popular MVTec AD and VisA datasets. Considering that this task requires both image-/pixel-level evaluations, the proposed GPT-4V-AD framework contains three components: \textbf{\textit{1)}} Granular Region Division, \textbf{\textit{2)}} Prompt Designing, \textbf{\textit{3)}} Text2Segmentation for easy quantitative evaluation, and have made some different attempts for comparative analysis. The results show that GPT-4V can achieve certain results in the zero-shot AD task through a VQA paradigm, such as achieving image-level 77.1/88.0 and pixel-level 68.0/76.6 AU-ROCs on MVTec AD and VisA datasets, respectively. However, its performance still has a certain gap compared to the state-of-the-art zero-shot method, \eg, WinCLIP and CLIP-AD, and further researches are needed. This study provides a baseline reference for the research of VQA-oriented LMM in the zero-shot AD task, and we also post several possible future works. Code is available at \url{https://github.com/zhangzjn/GPT-4V-AD}.

GPT-4V-AD: Exploring Grounding Potential of VQA-oriented GPT-4V for Zero-shot Anomaly Detection

TL;DR

The paper explores zero-shot visual anomaly detection using GPT-4V by formulating anomaly localization as a VQA-grounded task. It introduces the GPT-4V-AD framework with three components—Granular Region Division, Prompt Designing, and Text2Segmentation—to translate image regions into anomaly scores and segmentation maps. Empirical results on MVTec AD and VisA show competitive image- and pixel-level AU-ROC scores (e.g., 77.1/68.0 on MVTec AD and 88.0/76.6 on VisA) with notable gains on VisA, though gaps remain compared to state-of-the-art CLIP-based methods. The work establishes a baseline for VQA-oriented LVLMs in zero-shot AD and outlines concrete directions for improving grounding accuracy, stability, and efficiency in industrial anomaly detection tasks.

Abstract

Large Multimodal Model (LMM) GPT-4V(ision) endows GPT-4 with visual grounding capabilities, making it possible to handle certain tasks through the Visual Question Answering (VQA) paradigm. This paper explores the potential of VQA-oriented GPT-4V in the recently popular visual Anomaly Detection (AD) and is the first to conduct qualitative and quantitative evaluations on the popular MVTec AD and VisA datasets. Considering that this task requires both image-/pixel-level evaluations, the proposed GPT-4V-AD framework contains three components: \textbf{\textit{1)}} Granular Region Division, \textbf{\textit{2)}} Prompt Designing, \textbf{\textit{3)}} Text2Segmentation for easy quantitative evaluation, and have made some different attempts for comparative analysis. The results show that GPT-4V can achieve certain results in the zero-shot AD task through a VQA paradigm, such as achieving image-level 77.1/88.0 and pixel-level 68.0/76.6 AU-ROCs on MVTec AD and VisA datasets, respectively. However, its performance still has a certain gap compared to the state-of-the-art zero-shot method, \eg, WinCLIP and CLIP-AD, and further researches are needed. This study provides a baseline reference for the research of VQA-oriented LMM in the zero-shot AD task, and we also post several possible future works. Code is available at \url{https://github.com/zhangzjn/GPT-4V-AD}.
Paper Structure (13 sections, 8 figures, 2 tables)

This paper contains 13 sections, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Overview of the proposed GPT-4V-AD framework, which consists of three procedures in tandem: 1)Granular Region Division (\ref{['sec:method_image']}) preprocesses the input image $\bm{I}_{i}$, treating pixels that are similar at the structural or semantic level as a common region, resulting in $\bm{M}_{i}$. This is then combined with $\bm{I}_{i}$ through pixel-wise fusion to obtain the region-divided $\bm{I}_{f}$. 2)Prompt Designing (\ref{['sec:method_prompt']}) designs the suitable prompt $\bm{T}_{i}$ for the AD task in conjunction with $\bm{I}_{f}$, which is then input into GPT-4V to obtain a formatted output $\bm{T}_{o}$. 3)Text2Segmentation (\ref{['sec:method_t2s']}) combines the regions $\bm{M}_{i}$ to parse out pixel-level anomaly segmentation result $\bm{A}_{o}$.
  • Figure 2: A toy experiment with raw image as input and anomaly bounding box (expressed as percentage coordinates) as output for GPT-4V. This manner leads to uncontrollable and imprecise outputs, and it is challenging to obtains pixel-level segmentation results.
  • Figure 3: Ablation study on region division manners, i.e., naive gird, semantic SAM, and structural super-pixel.
  • Figure 4: Non-cherry-picked qualitative results for each category on the MVTec AD (left) and VisA (right) datasets.
  • Figure 5: Qualitative result comparison for different defect categories in the object hazelnut on MVTec AD dataset.
  • ...and 3 more figures