Table of Contents
Fetching ...

LogicQA: Logical Anomaly Detection with Vision Language Model Generated Questions

Yejin Kwon, Daeun Moon, Youngje Oh, Hyunsoo Yoon

TL;DR

This work tackles logical anomaly detection in industrial settings by introducing LogicQA, a training-free, few-shot framework that uses a pre-trained Vision-Language Model to automatically generate anomaly-relevant questions and provide natural-language explanations. By describing normal images, summarizing normal context, generating main questions, and testing with semantically varied sub-questions, LogicQA enables interpretable anomaly detection without task-specific training or annotations. It achieves state-of-the-art performance on MVTec LOCO AD and strong results on a real-world Semiconductor SEM dataset, while demonstrating robustness across different VLM backbones. The practical significance lies in scalable, explainable industrial AD with minimal data requirements, broad applicability across classes, and compatibility with multiple VLMs.

Abstract

Anomaly Detection (AD) focuses on detecting samples that differ from the standard pattern, making it a vital tool in process control. Logical anomalies may appear visually normal yet violate predefined constraints on object presence, arrangement, or quantity, depending on reasoning and explainability. We introduce LogicQA, a framework that enhances AD by providing industrial operators with explanations for logical anomalies. LogicQA compiles automatically generated questions into a checklist and collects responses to identify violations of logical constraints. LogicQA is training-free, annotation-free, and operates in a few-shot setting. We achieve state-of-the-art (SOTA) Logical AD performance on public benchmarks, MVTec LOCO AD, with an AUROC of 87.6 percent and an F1-max of 87.0 percent along with the explanations of anomalies. Also, our approach has shown outstanding performance on semiconductor SEM corporate data, further validating its effectiveness in industrial applications.

LogicQA: Logical Anomaly Detection with Vision Language Model Generated Questions

TL;DR

This work tackles logical anomaly detection in industrial settings by introducing LogicQA, a training-free, few-shot framework that uses a pre-trained Vision-Language Model to automatically generate anomaly-relevant questions and provide natural-language explanations. By describing normal images, summarizing normal context, generating main questions, and testing with semantically varied sub-questions, LogicQA enables interpretable anomaly detection without task-specific training or annotations. It achieves state-of-the-art performance on MVTec LOCO AD and strong results on a real-world Semiconductor SEM dataset, while demonstrating robustness across different VLM backbones. The practical significance lies in scalable, explainable industrial AD with minimal data requirements, broad applicability across classes, and compatibility with multiple VLMs.

Abstract

Anomaly Detection (AD) focuses on detecting samples that differ from the standard pattern, making it a vital tool in process control. Logical anomalies may appear visually normal yet violate predefined constraints on object presence, arrangement, or quantity, depending on reasoning and explainability. We introduce LogicQA, a framework that enhances AD by providing industrial operators with explanations for logical anomalies. LogicQA compiles automatically generated questions into a checklist and collects responses to identify violations of logical constraints. LogicQA is training-free, annotation-free, and operates in a few-shot setting. We achieve state-of-the-art (SOTA) Logical AD performance on public benchmarks, MVTec LOCO AD, with an AUROC of 87.6 percent and an F1-max of 87.0 percent along with the explanations of anomalies. Also, our approach has shown outstanding performance on semiconductor SEM corporate data, further validating its effectiveness in industrial applications.

Paper Structure

This paper contains 41 sections, 3 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Overview of Logical AD: (A) Models trained from scratch (e.g., AutoEncoder) perform logical AD but require a large number of images. (B) Models leveraging memory-based AD methods (e.g., PatchCore) use pre-trained vision models to extract visual features from normal images, enabling few-shot AD. (C) Our method, LogicQA, utilizes a pre-trained VLM to generate anomaly-relevant questions and analyze test images, using the answers to identify and explain abnormalities.
  • Figure 2: Pipeline of LogicQA. (1) Describing the Normal Images – The VLM generates textual descriptions of three normal images based on a predefined normality definition. (2) Summarizing the Normal Image Context – Shared features are extracted to define the core traits of normality. (3) Generating Main Questions – The VLM formulates key questions to assess whether an image is normal or anomalous. (4) Testing – The VLM generates sub-questions as variations of the main questions. Using a voting mechanism on the VLM’s responses, we determine whether the image satisfies the main questions. If it fails to satisfy even one, it is classified as anomalous.
  • Figure 3: Input Image Pre-Processing: BPM applies an attention mask to the original image, masking the background, preserving objects. Lang-SAM identifies objects relevant to the given prompt and returns them as bounding boxes.
  • Figure 4: MVTec LOCO AD Dataset Normal sample images
  • Figure 5: Log-Probability Distribution of VLM answers
  • ...and 2 more figures