Table of Contents
Fetching ...

EditScout: Locating Forged Regions from Diffusion-based Edited Images with Multimodal LLM

Quang Nguyen, Truong Vu, Trong-Tung Nguyen, Yuxin Wen, Preston K Robinette, Taylor T Johnson, Tom Goldstein, Anh Tran, Khoi Nguyen

TL;DR

This work tackles the challenging problem of localizing forged regions produced by diffusion-based image edits, which evade traditional forensic cues. It introduces EditScout, a two-module framework that leverages a multimodal large language model to generate a reasoning query from the input image and a prompt, which is then used by a segmentation head (SAM) to output a binary mask $ abla M \in [0,1]^{H \times W}$ of edited regions. Training uses ground-truth edit instructions and combines auto-regressive and segmentation losses, with LoRA fine-tuning for the MLLM and full training of the mask decoder. Evaluation on MagicBrush, AutoSplice, CocoGLIDE, and a new PerfBrush dataset demonstrates that EditScout generalizes to unseen edits and outperforms existing fore­sic baselines, highlighting the potential of integrating foundation models into digital image forensics.

Abstract

Image editing technologies are tools used to transform, adjust, remove, or otherwise alter images. Recent research has significantly improved the capabilities of image editing tools, enabling the creation of photorealistic and semantically informed forged regions that are nearly indistinguishable from authentic imagery, presenting new challenges in digital forensics and media credibility. While current image forensic techniques are adept at localizing forged regions produced by traditional image manipulation methods, current capabilities struggle to localize regions created by diffusion-based techniques. To bridge this gap, we present a novel framework that integrates a multimodal Large Language Model (LLM) for enhanced reasoning capabilities to localize tampered regions in images produced by diffusion model-based editing methods. By leveraging the contextual and semantic strengths of LLMs, our framework achieves promising results on MagicBrush, AutoSplice, and PerfBrush (novel diffusion-based dataset) datasets, outperforming previous approaches in mIoU and F1-score metrics. Notably, our method excels on the PerfBrush dataset, a self-constructed test set featuring previously unseen types of edits. Here, where traditional methods typically falter, achieving markedly low scores, our approach demonstrates promising performance.

EditScout: Locating Forged Regions from Diffusion-based Edited Images with Multimodal LLM

TL;DR

This work tackles the challenging problem of localizing forged regions produced by diffusion-based image edits, which evade traditional forensic cues. It introduces EditScout, a two-module framework that leverages a multimodal large language model to generate a reasoning query from the input image and a prompt, which is then used by a segmentation head (SAM) to output a binary mask of edited regions. Training uses ground-truth edit instructions and combines auto-regressive and segmentation losses, with LoRA fine-tuning for the MLLM and full training of the mask decoder. Evaluation on MagicBrush, AutoSplice, CocoGLIDE, and a new PerfBrush dataset demonstrates that EditScout generalizes to unseen edits and outperforms existing fore­sic baselines, highlighting the potential of integrating foundation models into digital image forensics.

Abstract

Image editing technologies are tools used to transform, adjust, remove, or otherwise alter images. Recent research has significantly improved the capabilities of image editing tools, enabling the creation of photorealistic and semantically informed forged regions that are nearly indistinguishable from authentic imagery, presenting new challenges in digital forensics and media credibility. While current image forensic techniques are adept at localizing forged regions produced by traditional image manipulation methods, current capabilities struggle to localize regions created by diffusion-based techniques. To bridge this gap, we present a novel framework that integrates a multimodal Large Language Model (LLM) for enhanced reasoning capabilities to localize tampered regions in images produced by diffusion model-based editing methods. By leveraging the contextual and semantic strengths of LLMs, our framework achieves promising results on MagicBrush, AutoSplice, and PerfBrush (novel diffusion-based dataset) datasets, outperforming previous approaches in mIoU and F1-score metrics. Notably, our method excels on the PerfBrush dataset, a self-constructed test set featuring previously unseen types of edits. Here, where traditional methods typically falter, achieving markedly low scores, our approach demonstrates promising performance.

Paper Structure

This paper contains 8 sections, 1 equation, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Visualizing edited images with annotated bounding boxes highlights notable disparities between (a) traditional editing methods (boxes in red) and (b) diffusion-based techniques (boxes in blue). While conventional edits are easily identifiable, diffusion-based alterations present challenges in image forensics due to their seamless blending of edited regions with their surroundings, yielding photorealistic results.
  • Figure 2: Overview of EditScout. There are two main modules: MLLM-based reasoning query generation and a segmentation model. The first module takes as input the user's prompt and image, producing a sequence of text tokens that includes a special [SEG] token representing the reasoning query and the edit instruction. The second module receives the [SEG] token as a query to generate the binary mask indicating the edited regions. Notably, only the mask decoder and a part of the MLLM are fine-tuned, while the other components remain frozen.
  • Figure 3: Edit instruction examples from the MagicBrush dataset zhang2024magicbrush (Left) and the AutoSplice dataset jia2023autosplice (Right).
  • Figure 4: We present a series of editing comparisons between two distinct datasets: MagicBrush and PerfBrush. Leveraging the source and mask images from MagicBrush, we employ the BrushNet inpainting pipeline to generate a diversed set of edited outcomes.
  • Figure 5: Comparison of predicted masks between EditScout and other methods on the MagicBrush (dev + test) (first two rows) and PerfBrush (last two rows) datasets.
  • ...and 1 more figures