Table of Contents
Fetching ...

Towards Training-free Anomaly Detection with Vision and Language Foundation Models

Jinjin Zhang, Guodong Wang, Yizhou Jin, Di Huang

TL;DR

This work introduces LogSAD, a training-free, multi-modal framework that unifies structural and logical anomaly detection by combining a match-of-thought procedure with three-granularity detectors (patch, interests, and composition) and a calibration-fusion suite. Using GPT-4V offline proposals, CLIP/DINOv2/SAM-based open-vocabulary segmentation, and CLIP-aligned composition checks, it achieves state-of-the-art performance among training-free methods across MVTec LOCO and strong results on standard AD datasets. The approach emphasizes interpretability through intermediate MoT prompts and robust fusion across detectors, enabling reliable anomaly detection without labeled training data. Empirical results show robustness in extreme few-shot regimes and practical applicability to industrial inspection, with detailed ablations validating each component. The work also discusses limitations in open-vocabulary segmentation and compositional generalization, outlining avenues for future improvements with more capable LMMs.

Abstract

Anomaly detection is valuable for real-world applications, such as industrial quality inspection. However, most approaches focus on detecting local structural anomalies while neglecting compositional anomalies incorporating logical constraints. In this paper, we introduce LogSAD, a novel multi-modal framework that requires no training for both Logical and Structural Anomaly Detection. First, we propose a match-of-thought architecture that employs advanced large multi-modal models (i.e. GPT-4V) to generate matching proposals, formulating interests and compositional rules of thought for anomaly detection. Second, we elaborate on multi-granularity anomaly detection, consisting of patch tokens, sets of interests, and composition matching with vision and language foundation models. Subsequently, we present a calibration module to align anomaly scores from different detectors, followed by integration strategies for the final decision. Consequently, our approach addresses both logical and structural anomaly detection within a unified framework and achieves state-of-the-art results without the need for training, even when compared to supervised approaches, highlighting its robustness and effectiveness. Code is available at https://github.com/zhang0jhon/LogSAD.

Towards Training-free Anomaly Detection with Vision and Language Foundation Models

TL;DR

This work introduces LogSAD, a training-free, multi-modal framework that unifies structural and logical anomaly detection by combining a match-of-thought procedure with three-granularity detectors (patch, interests, and composition) and a calibration-fusion suite. Using GPT-4V offline proposals, CLIP/DINOv2/SAM-based open-vocabulary segmentation, and CLIP-aligned composition checks, it achieves state-of-the-art performance among training-free methods across MVTec LOCO and strong results on standard AD datasets. The approach emphasizes interpretability through intermediate MoT prompts and robust fusion across detectors, enabling reliable anomaly detection without labeled training data. Empirical results show robustness in extreme few-shot regimes and practical applicability to industrial inspection, with detailed ablations validating each component. The work also discusses limitations in open-vocabulary segmentation and compositional generalization, outlining avenues for future improvements with more capable LMMs.

Abstract

Anomaly detection is valuable for real-world applications, such as industrial quality inspection. However, most approaches focus on detecting local structural anomalies while neglecting compositional anomalies incorporating logical constraints. In this paper, we introduce LogSAD, a novel multi-modal framework that requires no training for both Logical and Structural Anomaly Detection. First, we propose a match-of-thought architecture that employs advanced large multi-modal models (i.e. GPT-4V) to generate matching proposals, formulating interests and compositional rules of thought for anomaly detection. Second, we elaborate on multi-granularity anomaly detection, consisting of patch tokens, sets of interests, and composition matching with vision and language foundation models. Subsequently, we present a calibration module to align anomaly scores from different detectors, followed by integration strategies for the final decision. Consequently, our approach addresses both logical and structural anomaly detection within a unified framework and achieves state-of-the-art results without the need for training, even when compared to supervised approaches, highlighting its robustness and effectiveness. Code is available at https://github.com/zhang0jhon/LogSAD.

Paper Structure

This paper contains 12 sections, 5 equations, 6 figures, 11 tables, 1 algorithm.

Figures (6)

  • Figure 1: Examples of structural and logical anomalies in MVTec LOCO dataset bergmann2022beyond. Compositional multi-modal feature matching plays a crucial role in unified anomaly detection, particularly in identifying and categorizing logical anomalies effectively.
  • Figure 2: The framework of LogSAD. In the framework, we utilize match-of-thought to generate matching proposals, deriving text prompts of interests and compositional rules for anomaly detection. Based on the text prompts, our method leverages vision and language foundation models to achieve multi-granularity anomaly detection, followed by calibration and fusion modules to make final decision. Importantly, our algorithm detects both structural and logical anomalies within a unified framework, eliminating the need for training efforts.
  • Figure 3: Match-of-thought for prompt and match engineering. The vision and language instructions consist of few anomaly-free images and compositional logical constraints in MVTec LOCO.
  • Figure 4: Qualitative visualization results of open-vocabulary semantic segmentation on MVTec LOCO.
  • Figure 5: Comparison with LMMs in compositionality. Correct answers are marked in green, while incorrect ones are marked in yellow. Our proposed composition-granularity detector performs effectively, whereas LMMs struggle with issues of factuality and hallucination in logical anomaly understanding and detection.
  • ...and 1 more figures