Towards Training-free Anomaly Detection with Vision and Language Foundation Models

Jinjin Zhang; Guodong Wang; Yizhou Jin; Di Huang

Towards Training-free Anomaly Detection with Vision and Language Foundation Models

Jinjin Zhang, Guodong Wang, Yizhou Jin, Di Huang

TL;DR

This work introduces LogSAD, a training-free, multi-modal framework that unifies structural and logical anomaly detection by combining a match-of-thought procedure with three-granularity detectors (patch, interests, and composition) and a calibration-fusion suite. Using GPT-4V offline proposals, CLIP/DINOv2/SAM-based open-vocabulary segmentation, and CLIP-aligned composition checks, it achieves state-of-the-art performance among training-free methods across MVTec LOCO and strong results on standard AD datasets. The approach emphasizes interpretability through intermediate MoT prompts and robust fusion across detectors, enabling reliable anomaly detection without labeled training data. Empirical results show robustness in extreme few-shot regimes and practical applicability to industrial inspection, with detailed ablations validating each component. The work also discusses limitations in open-vocabulary segmentation and compositional generalization, outlining avenues for future improvements with more capable LMMs.

Abstract

Anomaly detection is valuable for real-world applications, such as industrial quality inspection. However, most approaches focus on detecting local structural anomalies while neglecting compositional anomalies incorporating logical constraints. In this paper, we introduce LogSAD, a novel multi-modal framework that requires no training for both Logical and Structural Anomaly Detection. First, we propose a match-of-thought architecture that employs advanced large multi-modal models (i.e. GPT-4V) to generate matching proposals, formulating interests and compositional rules of thought for anomaly detection. Second, we elaborate on multi-granularity anomaly detection, consisting of patch tokens, sets of interests, and composition matching with vision and language foundation models. Subsequently, we present a calibration module to align anomaly scores from different detectors, followed by integration strategies for the final decision. Consequently, our approach addresses both logical and structural anomaly detection within a unified framework and achieves state-of-the-art results without the need for training, even when compared to supervised approaches, highlighting its robustness and effectiveness. Code is available at https://github.com/zhang0jhon/LogSAD.

Towards Training-free Anomaly Detection with Vision and Language Foundation Models

TL;DR

Abstract

Towards Training-free Anomaly Detection with Vision and Language Foundation Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)