Table of Contents
Fetching ...

InpDiffusion: Image Inpainting Localization via Conditional Diffusion Models

Kai Wang, Shaozhang Niu, Qixian Hao, Jiwei Zhang

TL;DR

This work targets Image Inpainting Localization (IIL), a challenging task where prior methods often produce overconfident predictions and miss subtle tampering boundaries. It introduces InpDiffusion, a diffusion-model-based approach that treats IIL as conditional mask generation, guided by image semantics and edge priors to iteratively refine predictions. The model combines an Adaptive Conditional Network (ACN) with Hierarchical Feature Extraction and a Dual-stream Multi-scale Feature Extractor (DMFE) to capture semantic and edge cues, and a Denoising Network (DN) with edge supervision to jointly predict denoised masks and edges while balancing losses for robust supervision. Extensive experiments on Inpaint32K and additional datasets demonstrate state-of-the-art performance, excellent generalization to unseen tampering types, and strong robustness to common image attacks, offering a reliable, scalable solution for tampering localization in forensics and security contexts.

Abstract

As artificial intelligence advances rapidly, particularly with the advent of GANs and diffusion models, the accuracy of Image Inpainting Localization (IIL) has become increasingly challenging. Current IIL methods face two main challenges: a tendency towards overconfidence, leading to incorrect predictions; and difficulty in detecting subtle tampering boundaries in inpainted images. In response, we propose a new paradigm that treats IIL as a conditional mask generation task utilizing diffusion models. Our method, InpDiffusion, utilizes the denoising process enhanced by the integration of image semantic conditions to progressively refine predictions. During denoising, we employ edge conditions and introduce a novel edge supervision strategy to enhance the model's perception of edge details in inpainted objects. Balancing the diffusion model's stochastic sampling with edge supervision of tampered image regions mitigates the risk of incorrect predictions from overconfidence and prevents the loss of subtle boundaries that can result from overly stochastic processes. Furthermore, we propose an innovative Dual-stream Multi-scale Feature Extractor (DMFE) for extracting multi-scale features, enhancing feature representation by considering both semantic and edge conditions of the inpainted images. Extensive experiments across challenging datasets demonstrate that the InpDiffusion significantly outperforms existing state-of-the-art methods in IIL tasks, while also showcasing excellent generalization capabilities and robustness.

InpDiffusion: Image Inpainting Localization via Conditional Diffusion Models

TL;DR

This work targets Image Inpainting Localization (IIL), a challenging task where prior methods often produce overconfident predictions and miss subtle tampering boundaries. It introduces InpDiffusion, a diffusion-model-based approach that treats IIL as conditional mask generation, guided by image semantics and edge priors to iteratively refine predictions. The model combines an Adaptive Conditional Network (ACN) with Hierarchical Feature Extraction and a Dual-stream Multi-scale Feature Extractor (DMFE) to capture semantic and edge cues, and a Denoising Network (DN) with edge supervision to jointly predict denoised masks and edges while balancing losses for robust supervision. Extensive experiments on Inpaint32K and additional datasets demonstrate state-of-the-art performance, excellent generalization to unseen tampering types, and strong robustness to common image attacks, offering a reliable, scalable solution for tampering localization in forensics and security contexts.

Abstract

As artificial intelligence advances rapidly, particularly with the advent of GANs and diffusion models, the accuracy of Image Inpainting Localization (IIL) has become increasingly challenging. Current IIL methods face two main challenges: a tendency towards overconfidence, leading to incorrect predictions; and difficulty in detecting subtle tampering boundaries in inpainted images. In response, we propose a new paradigm that treats IIL as a conditional mask generation task utilizing diffusion models. Our method, InpDiffusion, utilizes the denoising process enhanced by the integration of image semantic conditions to progressively refine predictions. During denoising, we employ edge conditions and introduce a novel edge supervision strategy to enhance the model's perception of edge details in inpainted objects. Balancing the diffusion model's stochastic sampling with edge supervision of tampered image regions mitigates the risk of incorrect predictions from overconfidence and prevents the loss of subtle boundaries that can result from overly stochastic processes. Furthermore, we propose an innovative Dual-stream Multi-scale Feature Extractor (DMFE) for extracting multi-scale features, enhancing feature representation by considering both semantic and edge conditions of the inpainted images. Extensive experiments across challenging datasets demonstrate that the InpDiffusion significantly outperforms existing state-of-the-art methods in IIL tasks, while also showcasing excellent generalization capabilities and robustness.
Paper Structure (28 sections, 8 equations, 10 figures, 5 tables)

This paper contains 28 sections, 8 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Visualization results of tampered edges and region predictions captured by InpDiffusion at different sampling stages. $\hat{x}_0$ and $\hat{e}$ represent the predicted inpainted objects and their edges at different sampling stages, respectively. $\hat{x}_0^{w/o{\rm{ ES}}}$ are the predictions without edge supervision strategy .
  • Figure 2: The framework of our InpDiffusion which includes an Adaptive Conditional Network (ACN), and a Denoising Network (DN). Instead of relying on the discriminative learning paradigm, our framework adopts a generative approach to guarantee reliability and generalizability.
  • Figure 3: Illustration of Dual-stream Multi-scale Feature Extractor (DMFE) .
  • Figure 4: Illustration of Image Semantic and Edge Extraction .
  • Figure 5: Visual comparisons with recent SOTA models in challenging scenarios with different inpainting techniques.
  • ...and 5 more figures