Table of Contents
Fetching ...

Reasoning-Driven Anomaly Detection and Localization with Image-Level Supervision

Yizhou Jin, Yuezhu Feng, Jinjin Zhang, Peng Wang, Qingjie Liu, Yunhong Wang

Abstract

Multimodal large language models (MLLMs) have recently demonstrated remarkable reasoning and perceptual abilities for anomaly detection. However, most approaches remain confined to image-level anomaly detection and textual reasoning, while pixel-level localization still relies on external vision modules and dense annotations. In this work, we activate the intrinsic reasoning potential of MLLMs to perform anomaly detection, pixel-level localization, and interpretable reasoning solely from image-level supervision, without any auxiliary components or pixel-wise labels. Specifically, we propose Reasoning-Driven Anomaly Localization (ReAL), which extracts anomaly-related tokens from the autoregressive reasoning process and aggregates their attention responses to produce pixel-level anomaly maps. We further introduce a Consistency-Guided Reasoning Optimization (CGRO) module that leverages reinforcement learning to align reasoning tokens with visual attentions, resulting in more coherent reasoning and accurate anomaly localization. Extensive experiments on four public benchmarks demonstrate that our method significantly improves anomaly detection, localization, and interpretability. Remarkably, despite relying solely on image-level supervision, our approach achieves performance competitive with MLLM-based methods trained under dense pixel-level supervision. Code is available at https://github.com/YizhouJin313/ReADL.

Reasoning-Driven Anomaly Detection and Localization with Image-Level Supervision

Abstract

Multimodal large language models (MLLMs) have recently demonstrated remarkable reasoning and perceptual abilities for anomaly detection. However, most approaches remain confined to image-level anomaly detection and textual reasoning, while pixel-level localization still relies on external vision modules and dense annotations. In this work, we activate the intrinsic reasoning potential of MLLMs to perform anomaly detection, pixel-level localization, and interpretable reasoning solely from image-level supervision, without any auxiliary components or pixel-wise labels. Specifically, we propose Reasoning-Driven Anomaly Localization (ReAL), which extracts anomaly-related tokens from the autoregressive reasoning process and aggregates their attention responses to produce pixel-level anomaly maps. We further introduce a Consistency-Guided Reasoning Optimization (CGRO) module that leverages reinforcement learning to align reasoning tokens with visual attentions, resulting in more coherent reasoning and accurate anomaly localization. Extensive experiments on four public benchmarks demonstrate that our method significantly improves anomaly detection, localization, and interpretability. Remarkably, despite relying solely on image-level supervision, our approach achieves performance competitive with MLLM-based methods trained under dense pixel-level supervision. Code is available at https://github.com/YizhouJin313/ReADL.

Paper Structure

This paper contains 16 sections, 10 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Visualization of token-to-patch attention maps for certain reasoning tokens. Only a small subset of tokens (highlighted in red) exhibits focused attention on the true anomaly regions (red boxes), and these tokens are also semantically related to anomaly concepts (e.g., “scratch”, “mark”).
  • Figure 2: Overview of the proposed reasoning-driven anomaly detection framework. Given an input image, the MLLM generates a reasoning process and a final anomaly answer. The ReAL module selects anomaly-related tokens based on semantic relevance $S_\text{T}$ and spatial entropy $S_\text{I}$, then aggregates their visual attentions to obtain a pixel-level anomaly map. During training, these selected tokens are further used by the CGRO module to leverage reinforcement learning jointly driven by a reasoning–localization consistency reward and R1-based accuracy and format rewards, enabling the model to align reasoning tokens with their corresponding visual attentions while simultaneously improving reasoning correctness, structural quality, and anomaly localization accuracy.
  • Figure 3: Qualitative comparison of Qwen2.5-VL-7B, Qwen2.5-VL-7B+R1, and Qwen2.5-VL-7B+CGRO.
  • Figure 4: Visualization of different token selection used in ReAL. Each column shows anomaly maps generated by: (a) keeping all tokens (b) filtering with spatial entropy $S_\text{I}$, (c) filtering with semantic relevance $S_\text{T}$, (d) combining both scores with weighted aggregation. $S_\text{T}$ measures semantic alignment with anomaly-related concepts, while $S_\text{I}$ measures attention concentration. Methods that incorporate both criteria yield more accurate anomaly maps.
  • Figure 5: Qualitative comparison of Qwen2.5-VL-7B, Qwen2.5-VL-7B+R1, and Qwen2.5-VL-7B+CGRO.
  • ...and 2 more figures