Table of Contents
Fetching ...

EAGLE: Expert-Augmented Attention Guidance for Tuning-Free Industrial Anomaly Detection in Multimodal Large Language Models

Xiaomeng Peng, Xilang Huang, Seon Han Choi

TL;DR

This work tackles industrial anomaly detection by removing the need for expensive re-training of multimodal language models. It introduces EAGLE, a tuning-free framework that couples a PatchCore-based expert with frozen MLLMs, using Distribution-Based Thresholding to selectively inject expert prompts and Confidence-Aware Attention Scaling to mitigate reliance on potentially wrong textual priors. Across MVTec-AD and VisA, EAGLE consistently improves detection accuracy and recall across multiple backbones, achieving competitive results with fine-tuned baselines. The study also reveals a strong link between correct predictions and focused attention on true defect regions, suggesting that expert-guided prompting can enhance both performance and interpretability in industrial anomaly detection.

Abstract

Industrial anomaly detection is important for smart manufacturing, but many deep learning approaches produce only binary decisions and provide limited semantic explanations. Multimodal large language models (MLLMs) can potentially generate fine-grained, language-based analyses, yet existing methods often require costly fine-tuning and do not consistently improve anomaly detection accuracy compared to lightweight specialist detectors. We propose expert-augmented attention guidance for industrial anomaly detection in MLLMs (EAGLE), a tuning-free framework that integrates outputs from expert model to guide MLLMs toward both accurate detection and interpretable anomaly descriptions. We further study how EAGLE affects MLLMs internals by examining the attention distribution of MLLMs to the anomalous image regions in the intermediate layers. We observe that successful anomaly detection is associated with increased attention concentration on anomalous regions, and EAGLE tends to encourage this alignment. Experiments on MVTec-AD and VisA show that EAGLE improves anomaly detection performance across multiple MLLMs without any parameter updates, achieving results comparable to fine-tuning based methods. Code is available at \href{https://github.com/shengtun/Eagle}{https://github.com/shengtun/Eagle}

EAGLE: Expert-Augmented Attention Guidance for Tuning-Free Industrial Anomaly Detection in Multimodal Large Language Models

TL;DR

This work tackles industrial anomaly detection by removing the need for expensive re-training of multimodal language models. It introduces EAGLE, a tuning-free framework that couples a PatchCore-based expert with frozen MLLMs, using Distribution-Based Thresholding to selectively inject expert prompts and Confidence-Aware Attention Scaling to mitigate reliance on potentially wrong textual priors. Across MVTec-AD and VisA, EAGLE consistently improves detection accuracy and recall across multiple backbones, achieving competitive results with fine-tuned baselines. The study also reveals a strong link between correct predictions and focused attention on true defect regions, suggesting that expert-guided prompting can enhance both performance and interpretability in industrial anomaly detection.

Abstract

Industrial anomaly detection is important for smart manufacturing, but many deep learning approaches produce only binary decisions and provide limited semantic explanations. Multimodal large language models (MLLMs) can potentially generate fine-grained, language-based analyses, yet existing methods often require costly fine-tuning and do not consistently improve anomaly detection accuracy compared to lightweight specialist detectors. We propose expert-augmented attention guidance for industrial anomaly detection in MLLMs (EAGLE), a tuning-free framework that integrates outputs from expert model to guide MLLMs toward both accurate detection and interpretable anomaly descriptions. We further study how EAGLE affects MLLMs internals by examining the attention distribution of MLLMs to the anomalous image regions in the intermediate layers. We observe that successful anomaly detection is associated with increased attention concentration on anomalous regions, and EAGLE tends to encourage this alignment. Experiments on MVTec-AD and VisA show that EAGLE improves anomaly detection performance across multiple MLLMs without any parameter updates, achieving results comparable to fine-tuning based methods. Code is available at \href{https://github.com/shengtun/Eagle}{https://github.com/shengtun/Eagle}
Paper Structure (27 sections, 11 equations, 13 figures, 4 tables)

This paper contains 27 sections, 11 equations, 13 figures, 4 tables.

Figures (13)

  • Figure 1: Attention map visualizations from Qwen2.5-VL-7B on the MVTec-AD and VisA dataset. Compared with the original model, our framework guides the model to consistently focus on anomalous regions.
  • Figure 2: Pipeline of EAGLE. Given a query image, the expert model outputs an image-level anomaly score and a pixel-level anomaly map. The anomaly score is compared with a threshold $\tau$ estimated by the proposed DBT mechanism(see §\ref{['DBT']}), which determines whether the image is predicted as normal or abnormal and controls the selective injection of expert-generated visual prompts as well as the selection of corresponding textual prompts. When the anomaly score falls into the low-confidence region $[\tau, s_{\max}]$, the CAAS mechanism(see §\ref{['CAAM']}) conditionally enhances attention to visual tokens. The combined textual priors and visual prompts guide the MLLMs to produce the final anomaly prediction. The attention logits are defined as $A^{l,h} = QK^\top / \sqrt{d_k}$, and $X^{l-1}$ and $X^l$ represent the input and output of the MHA at layer $l$, respectively.
  • Figure 3: Average number of sampled and unsampled patches per training image for different classes during memory bank construction.
  • Figure 4: Anomaly score distributions across datasets. Normal training (green), normal test (blue), and abnormal test (purple) samples are shown. Additional results are included in the Appendix \ref{['app:sampled']}.
  • Figure 5: Illustration of the DBT mechanism. Normal training images are converted into patch-level features $\mathcal{F}$, which are stored in a memory bank constructed via greedy coreset sampling $\mathcal{G}$. For each training image $x_i$, its unsampled patch set $P_i^{(un)}$ is used to compute the image-level anomaly score through nearest-neighbor search (Eqs. \ref{['max_distance']}, \ref{['max_distance_2']}).
  • ...and 8 more figures