COCO-Inpaint: A Benchmark for Image Inpainting Detection and Manipulation Localization
Haozhen Yan, Yan Hong, Jiahui Zhan, Yikun Ji, Jun Lan, Huijia Zhu, Weiqiang Wang, Jianfu Zhang
TL;DR
COCO-Inpaint addresses the gap in IMDL benchmarks by providing a large-scale, inpainting-focused dataset built on MS-COCO, comprising 258,266 inpainted and 117,266 authentic images generated with six state-of-the-art inpainting models, four mask-generation strategies, and two guidance modes. The benchmark defines a rigorous evaluation framework with image-level and pixel-level metrics, enabling detailed comparisons across model architectures, mask types, and prompts. Experimental results show Vision Transformer–based detectors outperform CNN-based baselines, but cross-model generalization remains limited and highly sensitive to mask design, ratio, and text guidance, indicating key directions for improving robustness. By offering multi-level, diverse data and standardized evaluation, COCO-Inpaint enables robust assessment of IMDL methods and fosters progress toward more trustworthy image authenticity assessment and manipulation localization.
Abstract
Recent advancements in image manipulation have achieved unprecedented progress in generating photorealistic content, but also simultaneously eliminating barriers to arbitrary manipulation and editing, raising concerns about multimedia authenticity and cybersecurity. However, existing Image Manipulation Detection and Localization (IMDL) methodologies predominantly focus on splicing or copy-move forgeries, lacking dedicated benchmarks for inpainting-based manipulations. To bridge this gap, we present COCOInpaint, a comprehensive benchmark specifically designed for inpainting detection, with three key contributions: 1) High-quality inpainting samples generated by six state-of-the-art inpainting models, 2) Diverse generation scenarios enabled by four mask generation strategies with optional text guidance, and 3) Large-scale coverage with 258,266 inpainted images with rich semantic diversity. Our benchmark is constructed to emphasize intrinsic inconsistencies between inpainted and authentic regions, rather than superficial semantic artifacts such as object shapes. We establish a rigorous evaluation protocol using three standard metrics to assess existing IMDL approaches. The dataset will be made publicly available to facilitate future research in this area.
