ForgerySleuth: Empowering Multimodal Large Language Models for Image Manipulation Detection
Zhihao Sun, Haoran Jiang, Haoran Chen, Yixin Cao, Xipeng Qiu, Zuxuan Wu, Yu-Gang Jiang
TL;DR
ForgerySleuth advances image manipulation detection by integrating a multimodal LLM with a trace encoder and a fusion-based vision decoder to produce dense tampering masks and structured explanatory reasoning. It introduces the ForgeryAnalysis dataset (with Chain-of-Clues prompts) and a scalable ForgeryAnalysis-PT pretraining set via a dedicated data engine, addressing LLM hallucinations and explainability gaps. Empirical results show strong localization performance and superior forgery analysis quality across multiple benchmarks, with robust behavior under distortions. The work provides a practical framework and open resources to promote research on interpretable, generalizable IMD with M-LLMs.
Abstract
Multimodal large language models have unlocked new possibilities for various multimodal tasks. However, their potential in image manipulation detection remains unexplored. When directly applied to the IMD task, M-LLMs often produce reasoning texts that suffer from hallucinations and overthinking. To address this, we propose ForgerySleuth, which leverages M-LLMs to perform comprehensive clue fusion and generate segmentation outputs indicating specific regions that are tampered with. Moreover, we construct the ForgeryAnalysis dataset through the Chain-of-Clues prompt, which includes analysis and reasoning text to upgrade the image manipulation detection task. A data engine is also introduced to build a larger-scale dataset for the pre-training phase. Our extensive experiments demonstrate the effectiveness of ForgeryAnalysis and show that ForgerySleuth significantly outperforms existing methods in generalization, robustness, and explainability.
