EditScout: Locating Forged Regions from Diffusion-based Edited Images with Multimodal LLM

Quang Nguyen; Truong Vu; Trong-Tung Nguyen; Yuxin Wen; Preston K Robinette; Taylor T Johnson; Tom Goldstein; Anh Tran; Khoi Nguyen

EditScout: Locating Forged Regions from Diffusion-based Edited Images with Multimodal LLM

Quang Nguyen, Truong Vu, Trong-Tung Nguyen, Yuxin Wen, Preston K Robinette, Taylor T Johnson, Tom Goldstein, Anh Tran, Khoi Nguyen

TL;DR

This work tackles the challenging problem of localizing forged regions produced by diffusion-based image edits, which evade traditional forensic cues. It introduces EditScout, a two-module framework that leverages a multimodal large language model to generate a reasoning query from the input image and a prompt, which is then used by a segmentation head (SAM) to output a binary mask $ abla M \in [0,1]^{H \times W}$ of edited regions. Training uses ground-truth edit instructions and combines auto-regressive and segmentation losses, with LoRA fine-tuning for the MLLM and full training of the mask decoder. Evaluation on MagicBrush, AutoSplice, CocoGLIDE, and a new PerfBrush dataset demonstrates that EditScout generalizes to unseen edits and outperforms existing foresic baselines, highlighting the potential of integrating foundation models into digital image forensics.

Abstract

Image editing technologies are tools used to transform, adjust, remove, or otherwise alter images. Recent research has significantly improved the capabilities of image editing tools, enabling the creation of photorealistic and semantically informed forged regions that are nearly indistinguishable from authentic imagery, presenting new challenges in digital forensics and media credibility. While current image forensic techniques are adept at localizing forged regions produced by traditional image manipulation methods, current capabilities struggle to localize regions created by diffusion-based techniques. To bridge this gap, we present a novel framework that integrates a multimodal Large Language Model (LLM) for enhanced reasoning capabilities to localize tampered regions in images produced by diffusion model-based editing methods. By leveraging the contextual and semantic strengths of LLMs, our framework achieves promising results on MagicBrush, AutoSplice, and PerfBrush (novel diffusion-based dataset) datasets, outperforming previous approaches in mIoU and F1-score metrics. Notably, our method excels on the PerfBrush dataset, a self-constructed test set featuring previously unseen types of edits. Here, where traditional methods typically falter, achieving markedly low scores, our approach demonstrates promising performance.

EditScout: Locating Forged Regions from Diffusion-based Edited Images with Multimodal LLM

TL;DR

Abstract

EditScout: Locating Forged Regions from Diffusion-based Edited Images with Multimodal LLM

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)