Table of Contents
Fetching ...

LAD-Reasoner: Tiny Multimodal Models are Good Reasoners for Logical Anomaly Detection

Weijia Li, Guanglei Chu, Jiong Chen, Guo-Sen Xie, Caifeng Shan, Fang Zhao

TL;DR

RLAD introduces the task of Reasoning Logical Anomaly Detection and presents LAD-Reasoner, a 3B tiny multimodal model built on Qwen2.5-VL that jointly trains visual perception (SFT) and structured reasoning (GRPO). The approach yields accurate anomaly detection with human-readable rationales and matches the performance of a 72B model on MVTec LOCO AD while using substantially fewer parameters. By leveraging a two-stage training regime and reward-guided optimization without hand-crafted CoT data, it offers end-to-end efficiency and transparent reasoning suitable for industrial deployment. The work demonstrates that compact multimodal models can deliver strong logical reasoning capabilities with interpretable outputs, reducing reliance on large pipelines and external reasoning modules.

Abstract

Recent advances in industrial anomaly detection have highlighted the need for deeper logical anomaly analysis, where unexpected relationships among objects, counts, and spatial configurations must be identified and explained. Existing approaches often rely on large-scale external reasoning modules or elaborate pipeline designs, hindering practical deployment and interpretability. To address these limitations, we introduce a new task, Reasoning Logical Anomaly Detection (RLAD), which extends traditional anomaly detection by incorporating logical reasoning. We propose a new framework, LAD-Reasoner, a customized tiny multimodal language model built on Qwen2.5-VL 3B. Our approach leverages a two-stage training paradigm that first employs Supervised Fine-Tuning (SFT) for fine-grained visual understanding, followed by Group Relative Policy Optimization (GRPO) to refine logical anomaly detection and enforce coherent, human-readable reasoning. Crucially, reward signals are derived from both the detection accuracy and the structural quality of the outputs, obviating the need for building chain of thought (CoT) reasoning data. Experiments on the MVTec LOCO AD dataset show that LAD-Reasoner, though significantly smaller, matches the performance of Qwen2.5-VL-72B in accuracy and F1 score, and further excels in producing concise and interpretable rationales. This unified design reduces reliance on large models and complex pipelines, while offering transparent and interpretable insights into logical anomaly detection. Code and data will be released.

LAD-Reasoner: Tiny Multimodal Models are Good Reasoners for Logical Anomaly Detection

TL;DR

RLAD introduces the task of Reasoning Logical Anomaly Detection and presents LAD-Reasoner, a 3B tiny multimodal model built on Qwen2.5-VL that jointly trains visual perception (SFT) and structured reasoning (GRPO). The approach yields accurate anomaly detection with human-readable rationales and matches the performance of a 72B model on MVTec LOCO AD while using substantially fewer parameters. By leveraging a two-stage training regime and reward-guided optimization without hand-crafted CoT data, it offers end-to-end efficiency and transparent reasoning suitable for industrial deployment. The work demonstrates that compact multimodal models can deliver strong logical reasoning capabilities with interpretable outputs, reducing reliance on large pipelines and external reasoning modules.

Abstract

Recent advances in industrial anomaly detection have highlighted the need for deeper logical anomaly analysis, where unexpected relationships among objects, counts, and spatial configurations must be identified and explained. Existing approaches often rely on large-scale external reasoning modules or elaborate pipeline designs, hindering practical deployment and interpretability. To address these limitations, we introduce a new task, Reasoning Logical Anomaly Detection (RLAD), which extends traditional anomaly detection by incorporating logical reasoning. We propose a new framework, LAD-Reasoner, a customized tiny multimodal language model built on Qwen2.5-VL 3B. Our approach leverages a two-stage training paradigm that first employs Supervised Fine-Tuning (SFT) for fine-grained visual understanding, followed by Group Relative Policy Optimization (GRPO) to refine logical anomaly detection and enforce coherent, human-readable reasoning. Crucially, reward signals are derived from both the detection accuracy and the structural quality of the outputs, obviating the need for building chain of thought (CoT) reasoning data. Experiments on the MVTec LOCO AD dataset show that LAD-Reasoner, though significantly smaller, matches the performance of Qwen2.5-VL-72B in accuracy and F1 score, and further excels in producing concise and interpretable rationales. This unified design reduces reliance on large models and complex pipelines, while offering transparent and interpretable insights into logical anomaly detection. Code and data will be released.

Paper Structure

This paper contains 20 sections, 9 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: Overview of the task definition and comparison among existing traditional methods, MLLM-based methods, and our proposed LAD-Reasoner. While prior approaches fail to provide human-interpretable reasoning for anomaly detection, LAD-Reasoner delivers both accurate predictions and readable reasoning process.
  • Figure 2: Examples of input–output pairs used for SFT. Each sample consists of a question as a prompt, an image, and a corresponding answer mainly describing about anomaly.
  • Figure 3: The architecture of LAD-Reasoner. The training process consists of two stages. In the first stage, applying SFT to the base MLLM leads to improved visual detail understanding. In the second stage, the policy model is optimized based on verified rewards and the KL divergence penalty, enabling it to generate outputs that conform to a predefined structure and yield accurate final predictions..
  • Figure 4: Visualization of the inference results produced by LAD-Reasoner. For each subclass in the MVTec LOCO AD dataset, a representative test case is presented, including a reference image, a query image, and a natural language prompt inquiring whether an anomaly is present. The model responds with a thinking process (shown in italic) followed by a binary decision (shown in bold). For clarity of presentation, the original <think><\\think> and <answer><\\answer> tags are omitted.
  • Figure 5: The response lengths of models with and without SFT during the GRPO stage. The upper curve corresponds to the model with SFT, while the lower curve represents the model without SFT.
  • ...and 2 more figures