Table of Contents
Fetching ...

LogicAD: Explainable Anomaly Detection via VLM-based Text Feature Extraction

Er Jin, Qihui Feng, Yongli Mou, Stefan Decker, Gerhard Lakemeyer, Oliver Simons, Johannes Stegmaier

TL;DR

This work tackles the challenge of detecting logical anomalies in industrial images, where non-local inconsistencies demand reasoning beyond local visual cues. It introduces LogicAD, a training-free, one-shot framework that leverages text features extracted from autoregressive vision-language models, followed by format-based scoring and a formal logic reasoner to detect and explain anomalies. Key contributions include a text-feature extraction pipeline with memory-bank-style representations, a format-embedding module for anomaly scoring, and a Prover9-based reasoning component that yields human-readable explanations and rigorous normality criteria. Empirically, LogicAD achieves state-of-the-art one-shot performance on MVTec LOCO AD (AUROC 86.0%, F1-max 83.7%) and competitive results on MVTec AD, while providing explanations that enhance interpretability and potential practical deployment in industrial QA settings.

Abstract

Logical image understanding involves interpreting and reasoning about the relationships and consistency within an image's visual content. This capability is essential in applications such as industrial inspection, where logical anomaly detection is critical for maintaining high-quality standards and minimizing costly recalls. Previous research in anomaly detection (AD) has relied on prior knowledge for designing algorithms, which often requires extensive manual annotations, significant computing power, and large amounts of data for training. Autoregressive, multimodal Vision Language Models (AVLMs) offer a promising alternative due to their exceptional performance in visual reasoning across various domains. Despite this, their application to logical AD remains unexplored. In this work, we investigate using AVLMs for logical AD and demonstrate that they are well-suited to the task. Combining AVLMs with format embedding and a logic reasoner, we achieve SOTA performance on public benchmarks, MVTec LOCO AD, with an AUROC of 86.0% and F1-max of 83.7%, along with explanations of anomalies. This significantly outperforms the existing SOTA method by a large margin.

LogicAD: Explainable Anomaly Detection via VLM-based Text Feature Extraction

TL;DR

This work tackles the challenge of detecting logical anomalies in industrial images, where non-local inconsistencies demand reasoning beyond local visual cues. It introduces LogicAD, a training-free, one-shot framework that leverages text features extracted from autoregressive vision-language models, followed by format-based scoring and a formal logic reasoner to detect and explain anomalies. Key contributions include a text-feature extraction pipeline with memory-bank-style representations, a format-embedding module for anomaly scoring, and a Prover9-based reasoning component that yields human-readable explanations and rigorous normality criteria. Empirically, LogicAD achieves state-of-the-art one-shot performance on MVTec LOCO AD (AUROC 86.0%, F1-max 83.7%) and competitive results on MVTec AD, while providing explanations that enhance interpretability and potential practical deployment in industrial QA settings.

Abstract

Logical image understanding involves interpreting and reasoning about the relationships and consistency within an image's visual content. This capability is essential in applications such as industrial inspection, where logical anomaly detection is critical for maintaining high-quality standards and minimizing costly recalls. Previous research in anomaly detection (AD) has relied on prior knowledge for designing algorithms, which often requires extensive manual annotations, significant computing power, and large amounts of data for training. Autoregressive, multimodal Vision Language Models (AVLMs) offer a promising alternative due to their exceptional performance in visual reasoning across various domains. Despite this, their application to logical AD remains unexplored. In this work, we investigate using AVLMs for logical AD and demonstrate that they are well-suited to the task. Combining AVLMs with format embedding and a logic reasoner, we achieve SOTA performance on public benchmarks, MVTec LOCO AD, with an AUROC of 86.0% and F1-max of 83.7%, along with explanations of anomalies. This significantly outperforms the existing SOTA method by a large margin.
Paper Structure (12 sections, 3 equations, 5 figures, 6 tables)

This paper contains 12 sections, 3 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Overview of Anomaly Detection Approaches: (A) models trained from scratch require large-scale datasets and are capable of but lack reasoning capabilities. (B) Memory-based methods leverage a pre-trained vision model to extract features from normal images, enabling . However, they often require additional visual annotations and lack reasoning. (C) Our method uses pre-trained as a text feature extractors and uses it for detection and reasoning with only text prompts, eliminating the need for visual annotations.
  • Figure 2: Pipeline overview of LogicAD. The green box represents text feature extraction, $f_{i2t}$, which extracts features via pre-trained from the input image, the detailed process is depicted in Figure \ref{['fig:text_extraction']}. These features are then processed by two separate modules: format embedding (orange box) and logic reasoner (blue box). The format embedding module computes an anomaly score based on the similarity between embeddings of formatted normal and query features. The logic reasoner module utilizes logical rules derived from normal data to classify inputs as normal or abnormal while providing reasoning.
  • Figure 3: Text feature extraction$f_{i2t}$ involves extraction (blue box) and text embedding filtering (green box). Patches and the original image are processed by an to generate $K$ text descriptions. The green box uses text-embedding-3-large achiam2023gpt for output stabilization.
  • Figure 4: Illustration of a standard prompt versus a prompt. We use image $X_q$ from Figure \ref{['fig:text_extraction']} as the input. The ground truth description specifies that two cables are not connected to the same slot position. Using prompts based on , can generate more accurate descriptions of the input image.
  • Figure 5: Comparison of GPT-4o counting accuracy and additional visual examples. Accuracy of GPT-4o drops significantly with an increasing number of homogeneous objects. Figure \ref{['fig:ds']} shows two samples, one from CountBench (top) and one from UniformBench (bottom).