Table of Contents
Fetching ...

ForgerySleuth: Empowering Multimodal Large Language Models for Image Manipulation Detection

Zhihao Sun, Haoran Jiang, Haoran Chen, Yixin Cao, Xipeng Qiu, Zuxuan Wu, Yu-Gang Jiang

TL;DR

ForgerySleuth advances image manipulation detection by integrating a multimodal LLM with a trace encoder and a fusion-based vision decoder to produce dense tampering masks and structured explanatory reasoning. It introduces the ForgeryAnalysis dataset (with Chain-of-Clues prompts) and a scalable ForgeryAnalysis-PT pretraining set via a dedicated data engine, addressing LLM hallucinations and explainability gaps. Empirical results show strong localization performance and superior forgery analysis quality across multiple benchmarks, with robust behavior under distortions. The work provides a practical framework and open resources to promote research on interpretable, generalizable IMD with M-LLMs.

Abstract

Multimodal large language models have unlocked new possibilities for various multimodal tasks. However, their potential in image manipulation detection remains unexplored. When directly applied to the IMD task, M-LLMs often produce reasoning texts that suffer from hallucinations and overthinking. To address this, we propose ForgerySleuth, which leverages M-LLMs to perform comprehensive clue fusion and generate segmentation outputs indicating specific regions that are tampered with. Moreover, we construct the ForgeryAnalysis dataset through the Chain-of-Clues prompt, which includes analysis and reasoning text to upgrade the image manipulation detection task. A data engine is also introduced to build a larger-scale dataset for the pre-training phase. Our extensive experiments demonstrate the effectiveness of ForgeryAnalysis and show that ForgerySleuth significantly outperforms existing methods in generalization, robustness, and explainability.

ForgerySleuth: Empowering Multimodal Large Language Models for Image Manipulation Detection

TL;DR

ForgerySleuth advances image manipulation detection by integrating a multimodal LLM with a trace encoder and a fusion-based vision decoder to produce dense tampering masks and structured explanatory reasoning. It introduces the ForgeryAnalysis dataset (with Chain-of-Clues prompts) and a scalable ForgeryAnalysis-PT pretraining set via a dedicated data engine, addressing LLM hallucinations and explainability gaps. Empirical results show strong localization performance and superior forgery analysis quality across multiple benchmarks, with robust behavior under distortions. The work provides a practical framework and open resources to promote research on interpretable, generalizable IMD with M-LLMs.

Abstract

Multimodal large language models have unlocked new possibilities for various multimodal tasks. However, their potential in image manipulation detection remains unexplored. When directly applied to the IMD task, M-LLMs often produce reasoning texts that suffer from hallucinations and overthinking. To address this, we propose ForgerySleuth, which leverages M-LLMs to perform comprehensive clue fusion and generate segmentation outputs indicating specific regions that are tampered with. Moreover, we construct the ForgeryAnalysis dataset through the Chain-of-Clues prompt, which includes analysis and reasoning text to upgrade the image manipulation detection task. A data engine is also introduced to build a larger-scale dataset for the pre-training phase. Our extensive experiments demonstrate the effectiveness of ForgeryAnalysis and show that ForgerySleuth significantly outperforms existing methods in generalization, robustness, and explainability.

Paper Structure

This paper contains 35 sections, 9 equations, 18 figures, 13 tables.

Figures (18)

  • Figure 1: Performance and comparison of existing M-LLMs on the image manipulation detection task. Our ForgerySleuth assistant provides explanatory analysis with Chain-of-Clues and demonstrates the best forgery analysis capabilities.
  • Figure 2: ForgeryAnalysis Dataset Construction Pipeline. Our pipeline begins with (a) GPT-4o generating initial analyses for manipulated images with annotated regions, followed by human expert review. The refined analyses are organized into (b) the Chain-of-Clues format. This human-curated ForgeryAnalysis (2k) dataset is used to train a data engine. Finally, (c) this data engine generates ForgeryAnalysis-PT, a larger-scale dataset for model pre-training.
  • Figure 3: Framework of ForgerySleuth. Given an image $\mathbf{x}_\text{img}$ and a prompt query $\mathbf{x}_{p}$, the M-LLM $\mathcal{F}_{m}$ detects high-level semantic anomalies and generates a textual output $\hat{\mathbf{T}}$. The trace encoder $\mathcal{F}_{t}$ captures low-level, semantic-agnostic features. The vision decoder $\mathcal{D}$ fuses vision embeddings with the prompt embedding corresponding to [SEG] token to generate the segmentation mask $\hat{\mathbf{M}}$. LoRA is utilized in trainable modules for fine-tuning.
  • Figure 4: Visualization results comparing ForgerySleuth with existing methods. The examples are taken from various datasets.
  • Figure 5: Forgery analysis results of the ablation study.
  • ...and 13 more figures