Table of Contents
Fetching ...

OmniAD: Detect and Understand Industrial Anomaly via Multimodal Reasoning

Shifang Zhao, Yiheng Lin, Lu Han, Yao Zhao, Yunchao Wei

TL;DR

OmniAD tackles industrial anomaly understanding by unifying detection and reasoning in a multimodal framework. It converts segmentation into text generation with Text-as-Mask Encoding and uses Visual Guided Textual Reasoning to produce thorough analyses, trained with a combined SFT and GRPO regime. On MMAD and multiple anomaly-detection benchmarks, OmniAD achieves state-of-the-art or competitive results while eliminating the need for hand-tuned thresholds. The work advances practical industrial anomaly analysis by providing an explainable, few-shot capable system with publicly available code.

Abstract

While anomaly detection has made significant progress, generating detailed analyses that incorporate industrial knowledge remains a challenge. To address this gap, we introduce OmniAD, a novel framework that unifies anomaly detection and understanding for fine-grained analysis. OmniAD is a multimodal reasoner that combines visual and textual reasoning processes. The visual reasoning provides detailed inspection by leveraging Text-as-Mask Encoding to perform anomaly detection through text generation without manually selected thresholds. Following this, Visual Guided Textual Reasoning conducts comprehensive analysis by integrating visual perception. To enhance few-shot generalization, we employ an integrated training strategy that combines supervised fine-tuning (SFT) with reinforcement learning (GRPO), incorporating three sophisticated reward functions. Experimental results demonstrate that OmniAD achieves a performance of 79.1 on the MMAD benchmark, surpassing models such as Qwen2.5-VL-7B and GPT-4o. It also shows strong results across multiple anomaly detection benchmarks. These results highlight the importance of enhancing visual perception for effective reasoning in anomaly understanding. All codes and models will be publicly available.

OmniAD: Detect and Understand Industrial Anomaly via Multimodal Reasoning

TL;DR

OmniAD tackles industrial anomaly understanding by unifying detection and reasoning in a multimodal framework. It converts segmentation into text generation with Text-as-Mask Encoding and uses Visual Guided Textual Reasoning to produce thorough analyses, trained with a combined SFT and GRPO regime. On MMAD and multiple anomaly-detection benchmarks, OmniAD achieves state-of-the-art or competitive results while eliminating the need for hand-tuned thresholds. The work advances practical industrial anomaly analysis by providing an explainable, few-shot capable system with publicly available code.

Abstract

While anomaly detection has made significant progress, generating detailed analyses that incorporate industrial knowledge remains a challenge. To address this gap, we introduce OmniAD, a novel framework that unifies anomaly detection and understanding for fine-grained analysis. OmniAD is a multimodal reasoner that combines visual and textual reasoning processes. The visual reasoning provides detailed inspection by leveraging Text-as-Mask Encoding to perform anomaly detection through text generation without manually selected thresholds. Following this, Visual Guided Textual Reasoning conducts comprehensive analysis by integrating visual perception. To enhance few-shot generalization, we employ an integrated training strategy that combines supervised fine-tuning (SFT) with reinforcement learning (GRPO), incorporating three sophisticated reward functions. Experimental results demonstrate that OmniAD achieves a performance of 79.1 on the MMAD benchmark, surpassing models such as Qwen2.5-VL-7B and GPT-4o. It also shows strong results across multiple anomaly detection benchmarks. These results highlight the importance of enhancing visual perception for effective reasoning in anomaly understanding. All codes and models will be publicly available.

Paper Structure

This paper contains 21 sections, 1 equation, 8 figures, 17 tables.

Figures (8)

  • Figure 1: OmniAD unifies anomaly detection and understanding through Multimodal Reasoning. It leverages visual reasoning for anomaly classification and localization, followed by textual reasoning for a comprehensive analysis. The integrated supervised fine-tuning (SFT) and reinforcement learning (GRPO) training strategy ensures superior generalization with few-shot sample.
  • Figure 2: An illustration of Text-as-Mask encoding process.
  • Figure 3: Qualitative results on MMAD jiang2024mmad. The multimodal reasoning process facilitates accurate anomaly detection and question analysis. The ground truth mask and selected choice are indicated by the blue line.
  • Figure 4: Case Study of Multimodal Reasoning.
  • Figure 5: An illustration of results in MVTec-AD dataset.
  • ...and 3 more figures