Table of Contents
Fetching ...

COCO-Inpaint: A Benchmark for Image Inpainting Detection and Manipulation Localization

Haozhen Yan, Yan Hong, Jiahui Zhan, Yikun Ji, Jun Lan, Huijia Zhu, Weiqiang Wang, Jianfu Zhang

TL;DR

COCO-Inpaint addresses the gap in IMDL benchmarks by providing a large-scale, inpainting-focused dataset built on MS-COCO, comprising 258,266 inpainted and 117,266 authentic images generated with six state-of-the-art inpainting models, four mask-generation strategies, and two guidance modes. The benchmark defines a rigorous evaluation framework with image-level and pixel-level metrics, enabling detailed comparisons across model architectures, mask types, and prompts. Experimental results show Vision Transformer–based detectors outperform CNN-based baselines, but cross-model generalization remains limited and highly sensitive to mask design, ratio, and text guidance, indicating key directions for improving robustness. By offering multi-level, diverse data and standardized evaluation, COCO-Inpaint enables robust assessment of IMDL methods and fosters progress toward more trustworthy image authenticity assessment and manipulation localization.

Abstract

Recent advancements in image manipulation have achieved unprecedented progress in generating photorealistic content, but also simultaneously eliminating barriers to arbitrary manipulation and editing, raising concerns about multimedia authenticity and cybersecurity. However, existing Image Manipulation Detection and Localization (IMDL) methodologies predominantly focus on splicing or copy-move forgeries, lacking dedicated benchmarks for inpainting-based manipulations. To bridge this gap, we present COCOInpaint, a comprehensive benchmark specifically designed for inpainting detection, with three key contributions: 1) High-quality inpainting samples generated by six state-of-the-art inpainting models, 2) Diverse generation scenarios enabled by four mask generation strategies with optional text guidance, and 3) Large-scale coverage with 258,266 inpainted images with rich semantic diversity. Our benchmark is constructed to emphasize intrinsic inconsistencies between inpainted and authentic regions, rather than superficial semantic artifacts such as object shapes. We establish a rigorous evaluation protocol using three standard metrics to assess existing IMDL approaches. The dataset will be made publicly available to facilitate future research in this area.

COCO-Inpaint: A Benchmark for Image Inpainting Detection and Manipulation Localization

TL;DR

COCO-Inpaint addresses the gap in IMDL benchmarks by providing a large-scale, inpainting-focused dataset built on MS-COCO, comprising 258,266 inpainted and 117,266 authentic images generated with six state-of-the-art inpainting models, four mask-generation strategies, and two guidance modes. The benchmark defines a rigorous evaluation framework with image-level and pixel-level metrics, enabling detailed comparisons across model architectures, mask types, and prompts. Experimental results show Vision Transformer–based detectors outperform CNN-based baselines, but cross-model generalization remains limited and highly sensitive to mask design, ratio, and text guidance, indicating key directions for improving robustness. By offering multi-level, diverse data and standardized evaluation, COCO-Inpaint enables robust assessment of IMDL methods and fosters progress toward more trustworthy image authenticity assessment and manipulation localization.

Abstract

Recent advancements in image manipulation have achieved unprecedented progress in generating photorealistic content, but also simultaneously eliminating barriers to arbitrary manipulation and editing, raising concerns about multimedia authenticity and cybersecurity. However, existing Image Manipulation Detection and Localization (IMDL) methodologies predominantly focus on splicing or copy-move forgeries, lacking dedicated benchmarks for inpainting-based manipulations. To bridge this gap, we present COCOInpaint, a comprehensive benchmark specifically designed for inpainting detection, with three key contributions: 1) High-quality inpainting samples generated by six state-of-the-art inpainting models, 2) Diverse generation scenarios enabled by four mask generation strategies with optional text guidance, and 3) Large-scale coverage with 258,266 inpainted images with rich semantic diversity. Our benchmark is constructed to emphasize intrinsic inconsistencies between inpainted and authentic regions, rather than superficial semantic artifacts such as object shapes. We establish a rigorous evaluation protocol using three standard metrics to assess existing IMDL approaches. The dataset will be made publicly available to facilitate future research in this area.

Paper Structure

This paper contains 22 sections, 3 figures, 7 tables.

Figures (3)

  • Figure 1: Comparison of IMDL model performance on cross-dataset samples. The source images are derived from the AutoSplice AutoSplice_2023 dataset, which utilizes DALL-E2 DALLE2_2022 to generate the inpainted image. (a) and (b) respectively present the prediction results of the IML-ViT IML-ViT_2023 model trained on the CASIA CASIA_2013 and COCOInpaint datasets. Each example presents both the soft mask (top) and binary mask (bottom). As illustrated, training on COCO-Inpaint substantially enhances the model’s detection sensitivity and segmentation accuracy in inpainting scenarios.
  • Figure 2: Visualization of the COCO-Inpaint dataset. Mask1, Mask2, Mask3, and Mask4 represent Random Polygon, Segmentation-based Mask, Random Box, and Bounding Box, respectively. The spatial correspondence of the inpainted images aligns consistently with their respective masks. The prompts are derived from the captions in the MS-COCO dataset.
  • Figure 3: Visualization results of the IMDL models on the COCOInpaint.