Table of Contents
Fetching ...

Interpretable and Reliable Detection of AI-Generated Images via Grounded Reasoning in MLLMs

Yikun Ji, Hong Yan, Jun Lan, Huijia Zhu, Weiqiang Wang, Qi Fan, Liqing Zhang, Jianfu Zhang

TL;DR

This work tackles the need for interpretable AI-generated image detection by leveraging grounded reasoning in multi-modal LLMs. It introduces FakeXplained, a large dataset with region-level annotations and captions to enable visual-textual grounding, and presents a two-stage fine-tuning pipeline (SFT followed by GRPO-based RLHF) that produces accurate fake-detection and explainable, localized reasoning. The approach achieves state-of-the-art classification and grounding metrics, demonstrates strong human-alignment in explanations, and shows robustness to distribution shifts. The proposed framework advances trustworthy AI by coupling quantitative detection performance with human-understandable, region-specific justifications that explain why an image is fake.

Abstract

The rapid advancement of image generation technologies intensifies the demand for interpretable and robust detection methods. Although existing approaches often attain high accuracy, they typically operate as black boxes without providing human-understandable justifications. Multi-modal Large Language Models (MLLMs), while not originally intended for forgery detection, exhibit strong analytical and reasoning capabilities. When properly fine-tuned, they can effectively identify AI-generated images and offer meaningful explanations. However, existing MLLMs still struggle with hallucination and often fail to align their visual interpretations with actual image content and human reasoning. To bridge this gap, we construct a dataset of AI-generated images annotated with bounding boxes and descriptive captions that highlight synthesis artifacts, establishing a foundation for human-aligned visual-textual grounded reasoning. We then finetune MLLMs through a multi-stage optimization strategy that progressively balances the objectives of accurate detection, visual localization, and coherent textual explanation. The resulting model achieves superior performance in both detecting AI-generated images and localizing visual flaws, significantly outperforming baseline methods.

Interpretable and Reliable Detection of AI-Generated Images via Grounded Reasoning in MLLMs

TL;DR

This work tackles the need for interpretable AI-generated image detection by leveraging grounded reasoning in multi-modal LLMs. It introduces FakeXplained, a large dataset with region-level annotations and captions to enable visual-textual grounding, and presents a two-stage fine-tuning pipeline (SFT followed by GRPO-based RLHF) that produces accurate fake-detection and explainable, localized reasoning. The approach achieves state-of-the-art classification and grounding metrics, demonstrates strong human-alignment in explanations, and shows robustness to distribution shifts. The proposed framework advances trustworthy AI by coupling quantitative detection performance with human-understandable, region-specific justifications that explain why an image is fake.

Abstract

The rapid advancement of image generation technologies intensifies the demand for interpretable and robust detection methods. Although existing approaches often attain high accuracy, they typically operate as black boxes without providing human-understandable justifications. Multi-modal Large Language Models (MLLMs), while not originally intended for forgery detection, exhibit strong analytical and reasoning capabilities. When properly fine-tuned, they can effectively identify AI-generated images and offer meaningful explanations. However, existing MLLMs still struggle with hallucination and often fail to align their visual interpretations with actual image content and human reasoning. To bridge this gap, we construct a dataset of AI-generated images annotated with bounding boxes and descriptive captions that highlight synthesis artifacts, establishing a foundation for human-aligned visual-textual grounded reasoning. We then finetune MLLMs through a multi-stage optimization strategy that progressively balances the objectives of accurate detection, visual localization, and coherent textual explanation. The resulting model achieves superior performance in both detecting AI-generated images and localizing visual flaws, significantly outperforming baseline methods.

Paper Structure

This paper contains 53 sections, 5 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: An overview of our method.
  • Figure 1: Weights for different GRPO stages used in the training pipeline.
  • Figure 2: Human preference matrix.
  • Figure 3: Accuracy, IoU metric (upper), loss and reward curves (lower) of the model during the training process.
  • Figure 4: A sample user query and the corresponding model output (visualized).
  • ...and 2 more figures