Table of Contents
Fetching ...

Seeing Before Reasoning: A Unified Framework for Generalizable and Explainable Fake Image Detection

Kaiqing Lin, Zhiyuan Yan, Ruoxin Chen, Junyan Ye, Ke-Yue Zhang, Yue Zhou, Peng Jin, Bin Li, Taiping Yao, Shouhong Ding

TL;DR

Detecting AI-generated images with Multimodal LLMs is hampered by a perception-to-reasoning mismatch and brittle fine-tuning practices. The authors propose a seeing-before-reasoning paradigm implemented in Forensic-Chat, featuring a Visual Enhancement stage to strengthen artifact-aware perception and a Dialectical Fine-Tuning stage with multi-turn reasoning to resist shortcut learning and preserve pretrained knowledge. They accompany their method with ExplainFake-Bench, a dedicated benchmark for evaluating explainability across correctness, specificity, logical consistency, factual accuracy, and instruction following. Across diverse benchmarks and real-world distortions, Forensic-Chat achieves state-of-the-art generalization, reliable explanations, and robust knowledge preservation, all within a single MLLM without external detectors.

Abstract

Detecting AI-generated images with multimodal large language models (MLLMs) has gained increasing attention, due to their rich world knowledge, common-sense reasoning, and potential for explainability. However, naively applying those MLLMs for detection often leads to suboptimal performance. We argue that the root of this failure lies in a fundamental mismatch: MLLMs are asked to reason about fakes before they can truly see them. First, they do not really see: existing MLLMs' vision encoders are primarily optimized for semantic-oriented recognition rather than the perception of low-level signals, leaving them insensitive to subtle forgery traces. Without access to reliable perceptual evidence, the model grounds its judgment on incomplete and limited visual observations. Second, existing finetuning data for detection typically uses narrow, instruction-style formats, which diverge sharply from the diverse, heterogeneous distributions seen in pretraining. In the absence of meaningful visual cues, the model therefore exploits these linguistic shortcuts, resulting in catastrophic forgetting of pretrained knowledge (even the basic dialogue capabilities). In response, we advocate for a new paradigm: seeing before reasoning. We propose that MLLMs should first be trained to perceive artifacts-strengthening their artifact-aware visual perception-so that subsequent reasoning is grounded in actual observations. We therefore propose Forensic-Chat, a generalizable, explainable, and still-conversational (for multi-round dialogue) assistant for fake image detection. We also propose ExplainFake-Bench, a benchmark tailored for the evaluation of the MLLM's explainability for image forensics from five key aspects. Extensive experiments show its superiority of generalization and genuinely reliable explainability.

Seeing Before Reasoning: A Unified Framework for Generalizable and Explainable Fake Image Detection

TL;DR

Detecting AI-generated images with Multimodal LLMs is hampered by a perception-to-reasoning mismatch and brittle fine-tuning practices. The authors propose a seeing-before-reasoning paradigm implemented in Forensic-Chat, featuring a Visual Enhancement stage to strengthen artifact-aware perception and a Dialectical Fine-Tuning stage with multi-turn reasoning to resist shortcut learning and preserve pretrained knowledge. They accompany their method with ExplainFake-Bench, a dedicated benchmark for evaluating explainability across correctness, specificity, logical consistency, factual accuracy, and instruction following. Across diverse benchmarks and real-world distortions, Forensic-Chat achieves state-of-the-art generalization, reliable explanations, and robust knowledge preservation, all within a single MLLM without external detectors.

Abstract

Detecting AI-generated images with multimodal large language models (MLLMs) has gained increasing attention, due to their rich world knowledge, common-sense reasoning, and potential for explainability. However, naively applying those MLLMs for detection often leads to suboptimal performance. We argue that the root of this failure lies in a fundamental mismatch: MLLMs are asked to reason about fakes before they can truly see them. First, they do not really see: existing MLLMs' vision encoders are primarily optimized for semantic-oriented recognition rather than the perception of low-level signals, leaving them insensitive to subtle forgery traces. Without access to reliable perceptual evidence, the model grounds its judgment on incomplete and limited visual observations. Second, existing finetuning data for detection typically uses narrow, instruction-style formats, which diverge sharply from the diverse, heterogeneous distributions seen in pretraining. In the absence of meaningful visual cues, the model therefore exploits these linguistic shortcuts, resulting in catastrophic forgetting of pretrained knowledge (even the basic dialogue capabilities). In response, we advocate for a new paradigm: seeing before reasoning. We propose that MLLMs should first be trained to perceive artifacts-strengthening their artifact-aware visual perception-so that subsequent reasoning is grounded in actual observations. We therefore propose Forensic-Chat, a generalizable, explainable, and still-conversational (for multi-round dialogue) assistant for fake image detection. We also propose ExplainFake-Bench, a benchmark tailored for the evaluation of the MLLM's explainability for image forensics from five key aspects. Extensive experiments show its superiority of generalization and genuinely reliable explainability.

Paper Structure

This paper contains 33 sections, 1 equation, 5 figures, 11 tables.

Figures (5)

  • Figure 1: Illustration of a key limitation of existing MLLM-based detectors: models trained specifically for detection fail to follow even basic instructions. Moreover, the baseline produces nearly identical responses across different questions, even those unrelated to forensics. This undermines the reliability of the MLLM's explanations, as they lack fundamental instruction-following capabilities. In contrast, our proposed method supports conversational multi-round interaction and provides more consistent, trustworthy explanations to the users while achieving SOTA performance in generalization and robustness.
  • Figure 2: The overall pipeline of our method. In Stage 1, we exclusively fine-tune the parameters of the Vision Encoder, while in the subsequent stages, we only optimize the LLM.
  • Figure 3: The detailed illustration of the Stage 2 of our framework, where we first introduce a dialectical finetuning strategy that contrasts externally detected fake clues with internal common-sense and world knowledge. By weighing conflicting signals, the model enhances robustness against deception while preserving pretrained knowledge for reliable reasoning.
  • Figure 4: Performance (Acc (%)) across different input resolutions on GenImage.
  • Figure 5: Impact of dialogue turns on data alignment with the pre-trained Qwen2.5-VL-7B. As the number of turns increases, both Negative Log Likelihood (NLL) and Perplexity decrease, suggesting multi-turn dialogues are more consistent with the model's inherent knowledge.