Table of Contents
Fetching ...

Unlocking the Forgery Detection Potential of Vanilla MLLMs: A Novel Training-Free Pipeline

Rui Zuo, Qinyue Tong, Zhe-Ming Lu, Ziqian Lu

TL;DR

This work tackles the generalization gap in image forgery detection and localization by proposing Foresee, a training-free pipeline that exploits vanilla multimodal LLMs. By combining a type-prior reasoning approach, a copy-move focused Flexible Feature Detector, and MLLM-guided inference with GroundingDINO and SAM for precise localization, Foresee achieves superior localization accuracy and richer textual explanations without task-specific training. The method demonstrates strong cross-tampering-type generalization (copy-move, splicing, removal, local enhancement, deepfake, and AIGC-based editing) and robust interpretability, validated against multiple datasets and strong baselines. Overall, Foresee reveals the inherent generalization potential of vanilla MLLMs for forensics, offering a lightweight, scalable, and interpretable IFDL solution with practical deployment advantages.

Abstract

With the rapid advancement of artificial intelligence-generated content (AIGC) technologies, including multimodal large language models (MLLMs) and diffusion models, image generation and manipulation have become remarkably effortless. Existing image forgery detection and localization (IFDL) methods often struggle to generalize across diverse datasets and offer limited interpretability. Nowadays, MLLMs demonstrate strong generalization potential across diverse vision-language tasks, and some studies introduce this capability to IFDL via large-scale training. However, such approaches cost considerable computational resources, while failing to reveal the inherent generalization potential of vanilla MLLMs to address this problem. Inspired by this observation, we propose Foresee, a training-free MLLM-based pipeline tailored for image forgery analysis. It eliminates the need for additional training and enables a lightweight inference process, while surpassing existing MLLM-based methods in both tamper localization accuracy and the richness of textual explanations. Foresee employs a type-prior-driven strategy and utilizes a Flexible Feature Detector (FFD) module to specifically handle copy-move manipulations, thereby effectively unleashing the potential of vanilla MLLMs in the forensic domain. Extensive experiments demonstrate that our approach simultaneously achieves superior localization accuracy and provides more comprehensive textual explanations. Moreover, Foresee exhibits stronger generalization capability, outperforming existing IFDL methods across various tampering types, including copy-move, splicing, removal, local enhancement, deepfake, and AIGC-based editing. The code will be released in the final version.

Unlocking the Forgery Detection Potential of Vanilla MLLMs: A Novel Training-Free Pipeline

TL;DR

This work tackles the generalization gap in image forgery detection and localization by proposing Foresee, a training-free pipeline that exploits vanilla multimodal LLMs. By combining a type-prior reasoning approach, a copy-move focused Flexible Feature Detector, and MLLM-guided inference with GroundingDINO and SAM for precise localization, Foresee achieves superior localization accuracy and richer textual explanations without task-specific training. The method demonstrates strong cross-tampering-type generalization (copy-move, splicing, removal, local enhancement, deepfake, and AIGC-based editing) and robust interpretability, validated against multiple datasets and strong baselines. Overall, Foresee reveals the inherent generalization potential of vanilla MLLMs for forensics, offering a lightweight, scalable, and interpretable IFDL solution with practical deployment advantages.

Abstract

With the rapid advancement of artificial intelligence-generated content (AIGC) technologies, including multimodal large language models (MLLMs) and diffusion models, image generation and manipulation have become remarkably effortless. Existing image forgery detection and localization (IFDL) methods often struggle to generalize across diverse datasets and offer limited interpretability. Nowadays, MLLMs demonstrate strong generalization potential across diverse vision-language tasks, and some studies introduce this capability to IFDL via large-scale training. However, such approaches cost considerable computational resources, while failing to reveal the inherent generalization potential of vanilla MLLMs to address this problem. Inspired by this observation, we propose Foresee, a training-free MLLM-based pipeline tailored for image forgery analysis. It eliminates the need for additional training and enables a lightweight inference process, while surpassing existing MLLM-based methods in both tamper localization accuracy and the richness of textual explanations. Foresee employs a type-prior-driven strategy and utilizes a Flexible Feature Detector (FFD) module to specifically handle copy-move manipulations, thereby effectively unleashing the potential of vanilla MLLMs in the forensic domain. Extensive experiments demonstrate that our approach simultaneously achieves superior localization accuracy and provides more comprehensive textual explanations. Moreover, Foresee exhibits stronger generalization capability, outperforming existing IFDL methods across various tampering types, including copy-move, splicing, removal, local enhancement, deepfake, and AIGC-based editing. The code will be released in the final version.

Paper Structure

This paper contains 13 sections, 5 equations, 4 figures, 8 tables, 1 algorithm.

Figures (4)

  • Figure 1: Comparison of our Foresee pipeline with other state-of-the-art MLLM-based methods and vanilla MLLMs in terms of training overhead, deployment burden, and inference performance. Foresee surpasses existing MLLM-based tamper detection methods by operating without any training, requiring fewer computational and inference resources, providing more precise localization, and delivering richer textual explanations. Notably, vanilla MLLMs achieve a significant improvement in localization performance under our pipeline.
  • Figure 2: Qualitative comparison between vanilla MLLMs and our Foresee pipeline on splicing and copy-move cases in the IFDL task, illustrating Foresee’s improved localization accuracy.
  • Figure 3: Overview of the proposed Foresee pipeline. The input image $\mathbf{I}_{ori}$ and classification prompt $\mathbf{P}_{cls}$ are processed by $\operatorname{MLLM}$ to predict the classification label $\mathbf{C}$. If $\mathbf{C}$ indicates copy-move, the Flexible Feature Detector $\mathcal{G}_{dt}$ generates a hint image $\mathbf{I}_{hint}$. Then, $\mathbf{I}_{ori}$, $\mathbf{I}_{hint}$ (if available), and a task-specific prompt $\mathbf{P}_{sel}$ are fed into $\operatorname{MLLM}$ to produce a textual explanation $\mathbf{T}_{exp}$ and a concise description $\mathbf{T}_{desc}$ of the tampered region. Finally, $\mathbf{T}_{desc}$ guides the Grounded Segmentation Module $\mathcal{G}_{gs}$ to generate the tampering mask $\mathbf{M}_{seg}$.
  • Figure 4: Visual comparison of predicted forgery masks from different methods on representative manipulated images. Ground truth masks and results from several networks are shown for qualitative assessment.