Table of Contents
Fetching ...

GIM: A Million-scale Benchmark for Generative Image Manipulation Detection and Localization

Yirui Chen, Xudong Huang, Quan Zhang, Wei Li, Mingjian Zhu, Qiangyu Yan, Simiao Li, Hanting Chen, Hailin Hu, Jie Yang, Wei Liu, Jie Hu

TL;DR

This work tackles the challenge of detecting and localizing generative image manipulations by introducing the GIM dataset, a million-scale IMDL resource built from SAM-based masks and LLM-guided prompts across multiple generators. It establishes a two-setting benchmark (mix- and cross-generator) built on ImageNet and VOC, and proposes GIMFormer, a transformer-based architecture that fuses learned tampering traces (ShadowTracer) with frequency-spatial features (FSB) and multi-scale anomaly modeling (MWAM). Empirical results show GIMFormer achieves state-of-the-art performance in both detection and pixel-level localization, with ablations confirming the contribution of each component and the benefit of cross-generator pretraining. The dataset and framework enable robust evaluation of IMDL methods under realistic generative tampering and degradations, supporting improved trust in AI-generated content and informing future work on broader manipulation scenarios, including video.

Abstract

The extraordinary ability of generative models emerges as a new trend in image editing and generating realistic images, posing a serious threat to the trustworthiness of multimedia data and driving the research of image manipulation detection and location (IMDL). However, the lack of a large-scale data foundation makes the IMDL task unattainable. In this paper, we build a local manipulation data generation pipeline that integrates the powerful capabilities of SAM, LLM, and generative models. Upon this basis, we propose the GIM dataset, which has the following advantages: 1) Large scale, GIM includes over one million pairs of AI-manipulated images and real images. 2) Rich image content, GIM encompasses a broad range of image classes. 3) Diverse generative manipulation, the images are manipulated images with state-of-the-art generators and various manipulation tasks. The aforementioned advantages allow for a more comprehensive evaluation of IMDL methods, extending their applicability to diverse images. We introduce the GIM benchmark with two settings to evaluate existing IMDL methods. In addition, we propose a novel IMDL framework, termed GIMFormer, which consists of a ShadowTracer, Frequency-Spatial block (FSB), and a Multi-Window Anomalous Modeling (MWAM) module. Extensive experiments on the GIM demonstrate that GIMFormer surpasses the previous state-of-the-art approach on two different benchmarks.

GIM: A Million-scale Benchmark for Generative Image Manipulation Detection and Localization

TL;DR

This work tackles the challenge of detecting and localizing generative image manipulations by introducing the GIM dataset, a million-scale IMDL resource built from SAM-based masks and LLM-guided prompts across multiple generators. It establishes a two-setting benchmark (mix- and cross-generator) built on ImageNet and VOC, and proposes GIMFormer, a transformer-based architecture that fuses learned tampering traces (ShadowTracer) with frequency-spatial features (FSB) and multi-scale anomaly modeling (MWAM). Empirical results show GIMFormer achieves state-of-the-art performance in both detection and pixel-level localization, with ablations confirming the contribution of each component and the benefit of cross-generator pretraining. The dataset and framework enable robust evaluation of IMDL methods under realistic generative tampering and degradations, supporting improved trust in AI-generated content and informing future work on broader manipulation scenarios, including video.

Abstract

The extraordinary ability of generative models emerges as a new trend in image editing and generating realistic images, posing a serious threat to the trustworthiness of multimedia data and driving the research of image manipulation detection and location (IMDL). However, the lack of a large-scale data foundation makes the IMDL task unattainable. In this paper, we build a local manipulation data generation pipeline that integrates the powerful capabilities of SAM, LLM, and generative models. Upon this basis, we propose the GIM dataset, which has the following advantages: 1) Large scale, GIM includes over one million pairs of AI-manipulated images and real images. 2) Rich image content, GIM encompasses a broad range of image classes. 3) Diverse generative manipulation, the images are manipulated images with state-of-the-art generators and various manipulation tasks. The aforementioned advantages allow for a more comprehensive evaluation of IMDL methods, extending their applicability to diverse images. We introduce the GIM benchmark with two settings to evaluate existing IMDL methods. In addition, we propose a novel IMDL framework, termed GIMFormer, which consists of a ShadowTracer, Frequency-Spatial block (FSB), and a Multi-Window Anomalous Modeling (MWAM) module. Extensive experiments on the GIM demonstrate that GIMFormer surpasses the previous state-of-the-art approach on two different benchmarks.
Paper Structure (30 sections, 6 equations, 10 figures, 12 tables)

This paper contains 30 sections, 6 equations, 10 figures, 12 tables.

Figures (10)

  • Figure 1: Example images from the GIM dataset. Our dataset includes images manipulated by three state-of-the-art generators: Stable-Diffusion, GLIDE, and DDNM. Three columns display authentic images, manipulation masks and manipulated images.
  • Figure 2: An overview of the dataset generation. Given the original image and a user query (classification attribution or mouse input), the manipulation mask is extracted using SAM. Tampering prompts are then organized with LLM by combining replacement classes. The final generations are produced by generative models with the image, tampering mask and prompts.
  • Figure 2: Dataset Scale Experiment: Effect of Dataset Scale on the Performance of Base Models (SegFormer-b0, ResNet-50).
  • Figure 3: GIMFormer architecture. ShadowTracer extracts trace map $t$ from the input image $x$. The encoder combines $x$ and $t$ to generate pyramid features $F_i$ across four stages, which are sent to the decoder for manipulation detection and localization.
  • Figure 4: Generative manipulation leaves subtle traces, ShadowTracer identifies intrinsic patterns and reconstructs underlying tampering perturbations.
  • ...and 5 more figures