InpDiffusion: Image Inpainting Localization via Conditional Diffusion Models
Kai Wang, Shaozhang Niu, Qixian Hao, Jiwei Zhang
TL;DR
This work targets Image Inpainting Localization (IIL), a challenging task where prior methods often produce overconfident predictions and miss subtle tampering boundaries. It introduces InpDiffusion, a diffusion-model-based approach that treats IIL as conditional mask generation, guided by image semantics and edge priors to iteratively refine predictions. The model combines an Adaptive Conditional Network (ACN) with Hierarchical Feature Extraction and a Dual-stream Multi-scale Feature Extractor (DMFE) to capture semantic and edge cues, and a Denoising Network (DN) with edge supervision to jointly predict denoised masks and edges while balancing losses for robust supervision. Extensive experiments on Inpaint32K and additional datasets demonstrate state-of-the-art performance, excellent generalization to unseen tampering types, and strong robustness to common image attacks, offering a reliable, scalable solution for tampering localization in forensics and security contexts.
Abstract
As artificial intelligence advances rapidly, particularly with the advent of GANs and diffusion models, the accuracy of Image Inpainting Localization (IIL) has become increasingly challenging. Current IIL methods face two main challenges: a tendency towards overconfidence, leading to incorrect predictions; and difficulty in detecting subtle tampering boundaries in inpainted images. In response, we propose a new paradigm that treats IIL as a conditional mask generation task utilizing diffusion models. Our method, InpDiffusion, utilizes the denoising process enhanced by the integration of image semantic conditions to progressively refine predictions. During denoising, we employ edge conditions and introduce a novel edge supervision strategy to enhance the model's perception of edge details in inpainted objects. Balancing the diffusion model's stochastic sampling with edge supervision of tampered image regions mitigates the risk of incorrect predictions from overconfidence and prevents the loss of subtle boundaries that can result from overly stochastic processes. Furthermore, we propose an innovative Dual-stream Multi-scale Feature Extractor (DMFE) for extracting multi-scale features, enhancing feature representation by considering both semantic and edge conditions of the inpainted images. Extensive experiments across challenging datasets demonstrate that the InpDiffusion significantly outperforms existing state-of-the-art methods in IIL tasks, while also showcasing excellent generalization capabilities and robustness.
