Table of Contents
Fetching ...

Process Over Outcome: Cultivating Forensic Reasoning for Generalizable Multimodal Manipulation Detection

Yuchen Zhang, Yaxiong Wang, Kecheng Han, Yujiao Wu, Lianwei Wu, Li Zhu, Zhedong Zheng

TL;DR

It is argued that generalizable detection requires incorporating explicit forensic reasoning, rather than merely classifying a limited set of manipulation types, rather than merely classifying a limited set of manipulation types, which fails to generalize to unseen manipulation patterns.

Abstract

Recent advances in generative AI have significantly enhanced the realism of multimodal media manipulation, thereby posing substantial challenges to manipulation detection. Existing manipulation detection and grounding approaches predominantly focus on manipulation type classification under result-oriented supervision, which not only lacks interpretability but also tends to overfit superficial artifacts. In this paper, we argue that generalizable detection requires incorporating explicit forensic reasoning, rather than merely classifying a limited set of manipulation types, which fails to generalize to unseen manipulation patterns. To this end, we propose REFORM, a reasoning-driven framework that shifts learning from outcome fitting to process modeling. REFORM adopts a three-stage curriculum that first induces forensic rationales, then aligns reasoning with final judgments, and finally refines logical consistency via reinforcement learning. To support this paradigm, we introduce ROM, a large-scale dataset with rich reasoning annotations. Extensive experiments show that REFORM establishes new state-of-the-art performance with superior generalization, achieving 81.52% ACC on ROM, 76.65% ACC on DGM4, and 74.9 F1 on MMFakeBench.

Process Over Outcome: Cultivating Forensic Reasoning for Generalizable Multimodal Manipulation Detection

TL;DR

It is argued that generalizable detection requires incorporating explicit forensic reasoning, rather than merely classifying a limited set of manipulation types, rather than merely classifying a limited set of manipulation types, which fails to generalize to unseen manipulation patterns.

Abstract

Recent advances in generative AI have significantly enhanced the realism of multimodal media manipulation, thereby posing substantial challenges to manipulation detection. Existing manipulation detection and grounding approaches predominantly focus on manipulation type classification under result-oriented supervision, which not only lacks interpretability but also tends to overfit superficial artifacts. In this paper, we argue that generalizable detection requires incorporating explicit forensic reasoning, rather than merely classifying a limited set of manipulation types, which fails to generalize to unseen manipulation patterns. To this end, we propose REFORM, a reasoning-driven framework that shifts learning from outcome fitting to process modeling. REFORM adopts a three-stage curriculum that first induces forensic rationales, then aligns reasoning with final judgments, and finally refines logical consistency via reinforcement learning. To support this paradigm, we introduce ROM, a large-scale dataset with rich reasoning annotations. Extensive experiments show that REFORM establishes new state-of-the-art performance with superior generalization, achieving 81.52% ACC on ROM, 76.65% ACC on DGM4, and 74.9 F1 on MMFakeBench.
Paper Structure (48 sections, 12 equations, 17 figures, 11 tables)

This paper contains 48 sections, 12 equations, 17 figures, 11 tables.

Figures (17)

  • Figure 1: Comparison between learning paradigms. (a) The prevailing Result-Oriented Supervision usually suffers from poor generalization by merely fitting statistical artifacts of training data. (b) Our Reasoning-Driven Optimization facilitates robust generalization by explicitly optimizing the forensic reasoning chain, enabling the model to uncover intrinsic inconsistencies effectively across unseen domains.
  • Figure 2: Overview of the ROM dataset. Left: Representative samples spanning 9 manipulated and 1 real categories, ranging from face-centric edits to scene-level synthesis, each accompanied by a detailed reasoning annotation. Right: Statistical distribution showing the diversity of manipulation types and the coverage of news media domains.
  • Figure 3: Probability Density of Token Count for Answer and Reasoning.
  • Figure 4: Overview of the REFORM framework and its three-stage training curriculum. (a) The primary pipeline employs a Cognitive Priming Encoder $\mathcal{E}_p$ and a Dual-Decoder structure, $\mathcal{D}_r$ and $\mathcal{D}_a$, for reasoning-driven detection. (b) Cognitive Reasoning Warm-up via partial freezing. (c) Reasoning-Endowed Joint Fine-Tuning incorporating the Reason-Answer Consistency Loss $\mathcal{L}_{RAC}$. (d) Constraint-Aware Policy Refinement using GRPO-based Reinforcement Learning to align forensic logic with the final verdict.
  • Figure 5: The user interface of the human evaluation study where each participant is given pairs of news images and caption and asked to determine whether they are manipulated or not.
  • ...and 12 more figures