Table of Contents
Fetching ...

M3-AD: Reflection-aware Multi-modal, Multi-category, and Multi-dimensional Benchmark and Framework for Industrial Anomaly Detection

Chao Huang, Yanhui Li, Yunkang Cao, Wei Wang, Hongxi Huang, Jie Wen, Wenqi Ren, Xiaochun Cao

TL;DR

RA-Monitor is proposed, which models reflection as a learnable decision revision process and guides models to perform controlled self-correction when initial judgments are unreliable, thereby improving decision robustness.

Abstract

Although multimodal large language models (MLLMs) have advanced industrial anomaly detection toward a zero-shot paradigm, they still tend to produce high-confidence yet unreliable decisions in fine-grained and structurally complex industrial scenarios, and lack effective self-corrective mechanisms. To address this issue, we propose M3-AD, a unified reflection-aware multimodal framework for industrial anomaly detection. M3-AD comprises two complementary data resources: M3-AD-FT, designed for reflection-aligned fine-tuning, and M3-AD-Bench, designed for systematic cross-category evaluation, together providing a foundation for reflection-aware learning and reliability assessment. Building upon this foundation, we propose RA-Monitor, which models reflection as a learnable decision revision process and guides models to perform controlled self-correction when initial judgments are unreliable, thereby improving decision robustness. Extensive experiments conducted on M3-AD-Bench demonstrate that RA-Monitor outperforms multiple open-source and commercial MLLMs in zero-shot anomaly detection and anomaly analysis tasks. Code will be released at https://github.com/Yanhui-Lee/M3-AD.

M3-AD: Reflection-aware Multi-modal, Multi-category, and Multi-dimensional Benchmark and Framework for Industrial Anomaly Detection

TL;DR

RA-Monitor is proposed, which models reflection as a learnable decision revision process and guides models to perform controlled self-correction when initial judgments are unreliable, thereby improving decision robustness.

Abstract

Although multimodal large language models (MLLMs) have advanced industrial anomaly detection toward a zero-shot paradigm, they still tend to produce high-confidence yet unreliable decisions in fine-grained and structurally complex industrial scenarios, and lack effective self-corrective mechanisms. To address this issue, we propose M3-AD, a unified reflection-aware multimodal framework for industrial anomaly detection. M3-AD comprises two complementary data resources: M3-AD-FT, designed for reflection-aligned fine-tuning, and M3-AD-Bench, designed for systematic cross-category evaluation, together providing a foundation for reflection-aware learning and reliability assessment. Building upon this foundation, we propose RA-Monitor, which models reflection as a learnable decision revision process and guides models to perform controlled self-correction when initial judgments are unreliable, thereby improving decision robustness. Extensive experiments conducted on M3-AD-Bench demonstrate that RA-Monitor outperforms multiple open-source and commercial MLLMs in zero-shot anomaly detection and anomaly analysis tasks. Code will be released at https://github.com/Yanhui-Lee/M3-AD.
Paper Structure (34 sections, 13 equations, 37 figures, 9 tables)

This paper contains 34 sections, 13 equations, 37 figures, 9 tables.

Figures (37)

  • Figure 1: M3-AD enables self-correction of unreliable initial predictions through a reflection-aware mechanism, significantly improving anomaly type recognition and spatial localization in industrial anomaly detection compared to base models.
  • Figure 2: Overview of M3-AD-FT data construction pipeline. The pipeline consists of four stages: (1) collecting and organizing industrial images across multiple scenarios with structured anomaly annotations; (2) classifying data by scenario and generating initial model answers; (3) constructing thinking and reflective captions; (4) preparing training data through manual verification.
  • Figure 3: Overview of RA-Monitor. RAWS equips the pre-trained model with both thinking and reflective abilities, while RCRL further optimizes the model via consistency, accuracy, and reflection rewards. The lower part illustrates the unified metric computation used for multi-level evaluation of anomaly detection, type recognition, and localization.
  • Figure 4: Ablation of reflection reward.
  • Figure 5: Performance improvement after reflection.
  • ...and 32 more figures