VMAD: Visual-enhanced Multimodal Large Language Model for Zero-Shot Anomaly Detection

Huilin Deng; Hongchen Luo; Wei Zhai; Yang Cao; Yu Kang

VMAD: Visual-enhanced Multimodal Large Language Model for Zero-Shot Anomaly Detection

Huilin Deng, Hongchen Luo, Wei Zhai, Yang Cao, Yu Kang

TL;DR

VMAD (Visual-enhanced MLLM Anomaly Detection), a framework that enriches MLLM with visual IAD knowledge through two key components: a Defect-Sensitive Structure Learning scheme that transfers patch-similarities for improved discrimination, and a Locality-enhanced Token Compression that leverages multi-level local features for fine-grained detection.

Abstract

Zero-shot anomaly detection (ZSAD) recognizes and localizes anomalies in previously unseen objects by establishing feature mapping between textual prompts and inspection images, demonstrating excellent research value in flexible industrial manufacturing. However, existing ZSAD methods are limited by closed-world settings, struggling to unseen defects with predefined prompts. Recently, adapting Multimodal Large Language Models (MLLMs) for Industrial Anomaly Detection (IAD) presents a viable solution. Unlike fixed-prompt methods, MLLMs exhibit a generative paradigm with open-ended text interpretation, enabling more adaptive anomaly analysis. However, this adaption faces inherent challenges as anomalies often manifest in fine-grained regions and exhibit minimal visual discrepancies from normal samples. To address these challenges, we propose a novel framework VMAD (Visual-enhanced MLLM Anomaly Detection) that enhances MLLM with visual-based IAD knowledge and fine-grained perception, simultaneously providing precise detection and comprehensive analysis of anomalies. Specifically, we design a Defect-Sensitive Structure Learning scheme that transfers patch-similarities cues from visual branch to our MLLM for improved anomaly discrimination. Besides, we introduce a novel visual projector, Locality-enhanced Token Compression, which mines multi-level features in local contexts to enhance fine-grained detection. Furthermore, we introduce the Real Industrial Anomaly Detection (RIAD), a comprehensive IAD dataset with detailed anomaly descriptions and analyses, offering a valuable resource for MLLM-based IAD development. Extensive experiments on zero-shot benchmarks, including MVTec-AD, Visa, WFDD, and RIAD datasets, demonstrate our superior performance over state-of-the-art methods. The code and dataset will be available soon.

VMAD: Visual-enhanced Multimodal Large Language Model for Zero-Shot Anomaly Detection

TL;DR

Abstract

Paper Structure (28 sections, 23 equations, 10 figures, 4 tables)

This paper contains 28 sections, 23 equations, 10 figures, 4 tables.

Introduction
Related Work
Industrial Anomaly Detection
Zero-shot Anomaly Detection
Multimodal Large Language Models (MLLMs)
Method
Problem Setup
Overall Architecture
MLLM Framework
Visual Branch
Training Objectives
Defect-Sensitive Structure Learning
Visual Patch Similarity Distribution
Text-Visual Patch Similarity Distribution
Locality-enhanced Token Compression
...and 13 more sections

Figures (10)

Figure 1: Comparison between previous ZSAD methods and MLLMs-based ZSAD methods. (a) Previous ZSAD methods use fixed templates and generic descriptions, confined to closed-world anomaly detection. (b) MLLMs-based methods leverage open-ended text interpretation and generation for IAD, providing additional comprehensive analysis and adapting flexibly to diverse criteria across multiple scenarios.
Figure 2: Various visual projectors. Abstractors compress limited information, while LTC mines multi-level local cues.
Figure 3: (Left) Overview of VMAD. VMAD incorporates a visual branch for anomaly localization (Sec. \ref{['overview']}), with Locality-enhanced Token Compression serving as a visual projector (Sec. \ref{['LTC']}). (Right) Defect-Sensitive Structure Learning. It aligns visual and text-visual patch similarity distributions using PBSD loss, enhancing MLLM's sensitivity to anomalous structures (Sec. \ref{['PBSD']}). GP: Global Pooling, Sim: Similarity computation.
Figure 4: Overview of LTC mechanism. It incorporates multi-level visual cues through a coarse-to-fine scheme, providing comprehensive image information to the LLM.
Figure 5: Properties of the RIAD dataset. (a) RIAD data pairs: images, semantic segmentation masks encoding defect types, and GPT-generated text. Masks and text are stored in JSON format. (b) The horizontal axis represents the categories of objects, the vertical axis represents quantity, and different colors represent different defect types. The top part shows sample RIAD images with their source datasets indicated. (c) The ratio of normal images and abnormal images in each object class.
...and 5 more figures

VMAD: Visual-enhanced Multimodal Large Language Model for Zero-Shot Anomaly Detection

TL;DR

Abstract

VMAD: Visual-enhanced Multimodal Large Language Model for Zero-Shot Anomaly Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (10)