Table of Contents
Fetching ...

ManipShield: A Unified Framework for Image Manipulation Detection, Localization and Explanation

Zitong Xu, Huiyu Duan, Xiaoyu Wang, Zhaolin Cai, Kaiwei Zhang, Qiang Hu, Jing Liu, Xiongkuo Min, Guangtao Zhai

TL;DR

ManipShield addresses weaknesses in existing image manipulation detection by unifying detection, localization, and explanation within a multimodal language model framework. It leverages ManipBench, a large-scale, richly annotated dataset with 450K AI-edited images from 25 editing models across 12 categories, and 100K images annotated for localization and textual explanations. The model employs contrastive LoRA fine-tuning on a vision encoder, Layer Discrimination Selection to pick the most informative LLM layer, and three decoders to produce detection, cues, and bounding boxes, resulting in robust, interpretable manipulation analysis. Experimental results show state-of-the-art performance and strong generalization to unseen editing models, with comprehensive evaluation on additional datasets, underscoring the practical value of a unified, explainable IMDL framework.

Abstract

With the rapid advancement of generative models, powerful image editing methods now enable diverse and highly realistic image manipulations that far surpass traditional deepfake techniques, posing new challenges for manipulation detection. Existing image manipulation detection and localization (IMDL) benchmarks suffer from limited content diversity, narrow generative-model coverage, and insufficient interpretability, which hinders the generalization and explanation capabilities of current manipulation detection methods. To address these limitations, we introduce \textbf{ManipBench}, a large-scale benchmark for image manipulation detection and localization focusing on AI-edited images. ManipBench contains over 450K manipulated images produced by 25 state-of-the-art image editing models across 12 manipulation categories, among which 100K images are further annotated with bounding boxes, judgment cues, and textual explanations to support interpretable detection. Building upon ManipBench, we propose \textbf{ManipShield}, an all-in-one model based on a Multimodal Large Language Model (MLLM) that leverages contrastive LoRA fine-tuning and task-specific decoders to achieve unified image manipulation detection, localization, and explanation. Extensive experiments on ManipBench and several public datasets demonstrate that ManipShield achieves state-of-the-art performance and exhibits strong generality to unseen manipulation models. Both ManipBench and ManipShield will be released upon publication.

ManipShield: A Unified Framework for Image Manipulation Detection, Localization and Explanation

TL;DR

ManipShield addresses weaknesses in existing image manipulation detection by unifying detection, localization, and explanation within a multimodal language model framework. It leverages ManipBench, a large-scale, richly annotated dataset with 450K AI-edited images from 25 editing models across 12 categories, and 100K images annotated for localization and textual explanations. The model employs contrastive LoRA fine-tuning on a vision encoder, Layer Discrimination Selection to pick the most informative LLM layer, and three decoders to produce detection, cues, and bounding boxes, resulting in robust, interpretable manipulation analysis. Experimental results show state-of-the-art performance and strong generalization to unseen editing models, with comprehensive evaluation on additional datasets, underscoring the practical value of a unified, explainable IMDL framework.

Abstract

With the rapid advancement of generative models, powerful image editing methods now enable diverse and highly realistic image manipulations that far surpass traditional deepfake techniques, posing new challenges for manipulation detection. Existing image manipulation detection and localization (IMDL) benchmarks suffer from limited content diversity, narrow generative-model coverage, and insufficient interpretability, which hinders the generalization and explanation capabilities of current manipulation detection methods. To address these limitations, we introduce \textbf{ManipBench}, a large-scale benchmark for image manipulation detection and localization focusing on AI-edited images. ManipBench contains over 450K manipulated images produced by 25 state-of-the-art image editing models across 12 manipulation categories, among which 100K images are further annotated with bounding boxes, judgment cues, and textual explanations to support interpretable detection. Building upon ManipBench, we propose \textbf{ManipShield}, an all-in-one model based on a Multimodal Large Language Model (MLLM) that leverages contrastive LoRA fine-tuning and task-specific decoders to achieve unified image manipulation detection, localization, and explanation. Extensive experiments on ManipBench and several public datasets demonstrate that ManipShield achieves state-of-the-art performance and exhibits strong generality to unseen manipulation models. Both ManipBench and ManipShield will be released upon publication.

Paper Structure

This paper contains 40 sections, 13 equations, 13 figures, 12 tables.

Figures (13)

  • Figure 1: An overview of the constructed image manipulation database and the proposed image manipulation detection model, termed ManipBench and ManipShield, respectively. (a) We first collect 22K real-world images and produce numerous editing prompts using multimodal large language model across 12 manipulation categories. Then 25 latest image editing models are applied to generate 455K manipulated images. (b) 100K images are further annotated with localization and explanation information. (c) We design ManipShield for image manipulation detection, localization and explanation. (d) We perform model comparisons based on ManipBench.
  • Figure 2: Feature distribution of the ManipBench. (a) Manipulated images. (b) Real images. Manipulated images exhibit decreased spatial information (SI) but increased colorfulness and contrast.
  • Figure 3: An overview of ManipShield. First, manipulated and real images are paired to train the vision encoder through contrastive LoRA fine-tuning. Then, we freeze the vision encoder, and feed the projected image features and an assistant prompt into the LLM. The layer discrimination selection (LDS) module then identifies the LLM layer that best separates positive and negative samples. The hidden state from this layer is passed through three decoders including: a manipulation detection decoder for classification, an unrealistic attribute decoder for judgment cues extraction, and a region localization decoder for bounding boxes prediction. The outputs are integrated into a structured prompt, which, together with the image, is used to generate explicit explanatory analysis.
  • Figure 4: The plots of KL divergence, LDR, information entropy and saliency score across hidden states from different LLM layers.
  • Figure 5: Examples of bounding box prediction results.
  • ...and 8 more figures