Table of Contents
Fetching ...

Myriad: Large Multimodal Model by Applying Vision Experts for Industrial Anomaly Detection

Yuanze Li, Haolin Wang, Shihao Yuan, Ming Liu, Debin Zhao, Yiwen Guo, Chen Xu, Guangming Shi, Wangmeng Zuo

TL;DR

Myriad addresses industrial anomaly detection's need for flexible, deployment-agnostic solutions by integrating existing IAD vision experts as guidance for a large multimodal backbone. A VE-guided vision encoder with LoRRA and a visual prompt generator, complemented by a textual prompt generator, injects IAD-domain knowledge into the LLM, enabling accurate anomaly localization and rich, instruction-following descriptions. Across MVTec-AD, VisA, and PCB Bank, Myriad achieves state-of-the-art or competitive results in one-class and few-shot settings, while also supporting zero-shot operation via different vision experts. The approach demonstrates strong generalization, interpretability, and practical applicability in dynamic manufacturing contexts, with code and models publicly available.

Abstract

Due to the training configuration, traditional industrial anomaly detection (IAD) methods have to train a specific model for each deployment scenario, which is insufficient to meet the requirements of modern design and manufacturing. On the contrary, large multimodal models~(LMMs) have shown eminent generalization ability on various vision tasks, and their perception and comprehension capabilities imply the potential of applying LMMs on IAD tasks. However, we observe that even though the LMMs have abundant knowledge about industrial anomaly detection in the textual domain, the LMMs are unable to leverage the knowledge due to the modality gap between textual and visual domains. To stimulate the relevant knowledge in LMMs and adapt the LMMs towards anomaly detection tasks, we introduce existing IAD methods as vision experts and present a novel large multimodal model applying vision experts for industrial anomaly detection~(abbreviated to {Myriad}). Specifically, we utilize the anomaly map generated by the vision experts as guidance for LMMs, such that the vision model is guided to pay more attention to anomalous regions. Then, the visual features are modulated via an adapter to fit the anomaly detection tasks, which are fed into the language model together with the vision expert guidance and human instructions to generate the final outputs. Extensive experiments are applied on MVTec-AD, VisA, and PCB Bank benchmarks demonstrate that our proposed method not only performs favorably against state-of-the-art methods, but also inherits the flexibility and instruction-following ability of LMMs in the field of IAD. Source code and pre-trained models are publicly available at \url{https://github.com/tzjtatata/Myriad}.

Myriad: Large Multimodal Model by Applying Vision Experts for Industrial Anomaly Detection

TL;DR

Myriad addresses industrial anomaly detection's need for flexible, deployment-agnostic solutions by integrating existing IAD vision experts as guidance for a large multimodal backbone. A VE-guided vision encoder with LoRRA and a visual prompt generator, complemented by a textual prompt generator, injects IAD-domain knowledge into the LLM, enabling accurate anomaly localization and rich, instruction-following descriptions. Across MVTec-AD, VisA, and PCB Bank, Myriad achieves state-of-the-art or competitive results in one-class and few-shot settings, while also supporting zero-shot operation via different vision experts. The approach demonstrates strong generalization, interpretability, and practical applicability in dynamic manufacturing contexts, with code and models publicly available.

Abstract

Due to the training configuration, traditional industrial anomaly detection (IAD) methods have to train a specific model for each deployment scenario, which is insufficient to meet the requirements of modern design and manufacturing. On the contrary, large multimodal models~(LMMs) have shown eminent generalization ability on various vision tasks, and their perception and comprehension capabilities imply the potential of applying LMMs on IAD tasks. However, we observe that even though the LMMs have abundant knowledge about industrial anomaly detection in the textual domain, the LMMs are unable to leverage the knowledge due to the modality gap between textual and visual domains. To stimulate the relevant knowledge in LMMs and adapt the LMMs towards anomaly detection tasks, we introduce existing IAD methods as vision experts and present a novel large multimodal model applying vision experts for industrial anomaly detection~(abbreviated to {Myriad}). Specifically, we utilize the anomaly map generated by the vision experts as guidance for LMMs, such that the vision model is guided to pay more attention to anomalous regions. Then, the visual features are modulated via an adapter to fit the anomaly detection tasks, which are fed into the language model together with the vision expert guidance and human instructions to generate the final outputs. Extensive experiments are applied on MVTec-AD, VisA, and PCB Bank benchmarks demonstrate that our proposed method not only performs favorably against state-of-the-art methods, but also inherits the flexibility and instruction-following ability of LMMs in the field of IAD. Source code and pre-trained models are publicly available at \url{https://github.com/tzjtatata/Myriad}.
Paper Structure (16 sections, 8 equations, 6 figures, 6 tables)

This paper contains 16 sections, 8 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Existing IAD methods are limited in predicting anomaly maps and anomaly scores without comprehension descriptions (a) while LMMs like MiniGPT-4 cannot well-generate IAD-related descriptions, (b) By incorporating pre-trained IAD models as vision experts, our Myriad can perceive IAD domain knowledge via the introduced vision expert-guided vision encoder (\ref{['sec:veguided']}) and vision expert guidance adapter(\ref{['sec:veguidance']}). Our Myriad provides not only favorable anomaly detection accuracy but also instruction-following capability.
  • Figure 2: The architecture of proposed Myriad. Given an input industrial image, the vision expert estimates an anomaly map $\mathbf{M}$ containing prior knowledge. To adapt visual features for the IAD task, we propose a VE-guided vision encoder, which enhances vision features for better alignment with industrial images and focuses more on regions via expert prompts generated from the visual prompt generator (VPG). Furthermore, the textual prompt generator (TPG) embeds the anomaly map into vision expert tokens, enhancing the LLM's ability to utilize additional information.
  • Figure 3: MiniGPT4 minigpt4 fails to utilize the IAD knowledge in Vicuna vicuna to recognize missing wax on the candles.
  • Figure 4: The illustration of flexibility. Myriad inherits the flexibility from LMMs. Myriad receives human expertise and further generates correct text sequence. The ground truth defects are highlighted with red bounding boxes in the image.
  • Figure 5: The qualitative comparison between Myriad and the state-of-the-art LMMs. Correct details are highlighted in green, while incorrect details are marked in red. The ground truth defects are highlighted with red bounding boxes in the image. Best view in color version.
  • ...and 1 more figures