Table of Contents
Fetching ...

Structure Matters: Tackling the Semantic Discrepancy in Diffusion Models for Image Inpainting

Haipeng Liu, Yang Wang, Biao Qian, Meng Wang, Yong Rui

TL;DR

This work addresses semantic discrepancy in diffusion-based image inpainting by introducing StrDiffusion, a structure-guided diffusion model that uses progressively sparse auxiliary structure to guide texture denoising. By reformulating the denoising objective around the structure and employing a time-aware guidance strategy, the method achieves consistent, meaningful inpainting results and reduces mismatch between masked and unmasked regions. A discriminator-based correlation measure and an adaptive resampling mechanism regulate structure-texture alignment, which is shown to improve performance across multiple datasets (PSV, CelebA, Places2) with higher PSNR/SSIM and lower FID. The approach demonstrates that time-varying structural guidance can effectively balance semantic consistency and richness, offering a practical advance for diffusion-based inpainting and related tasks; code is released at the provided URL.

Abstract

Denoising diffusion probabilistic models for image inpainting aim to add the noise to the texture of image during the forward process and recover masked regions with unmasked ones of the texture via the reverse denoising process. Despite the meaningful semantics generation, the existing arts suffer from the semantic discrepancy between masked and unmasked regions, since the semantically dense unmasked texture fails to be completely degraded while the masked regions turn to the pure noise in diffusion process, leading to the large discrepancy between them. In this paper, we aim to answer how unmasked semantics guide texture denoising process;together with how to tackle the semantic discrepancy, to facilitate the consistent and meaningful semantics generation. To this end, we propose a novel structure-guided diffusion model named StrDiffusion, to reformulate the conventional texture denoising process under structure guidance to derive a simplified denoising objective for image inpainting, while revealing: 1) the semantically sparse structure is beneficial to tackle semantic discrepancy in early stage, while dense texture generates reasonable semantics in late stage; 2) the semantics from unmasked regions essentially offer the time-dependent structure guidance for the texture denoising process, benefiting from the time-dependent sparsity of the structure semantics. For the denoising process, a structure-guided neural network is trained to estimate the simplified denoising objective by exploiting the consistency of the denoised structure between masked and unmasked regions. Besides, we devise an adaptive resampling strategy as a formal criterion as whether structure is competent to guide the texture denoising process, while regulate their semantic correlations. Extensive experiments validate the merits of StrDiffusion over the state-of-the-arts. Our code is available at https://github.com/htyjers/StrDiffusion.

Structure Matters: Tackling the Semantic Discrepancy in Diffusion Models for Image Inpainting

TL;DR

This work addresses semantic discrepancy in diffusion-based image inpainting by introducing StrDiffusion, a structure-guided diffusion model that uses progressively sparse auxiliary structure to guide texture denoising. By reformulating the denoising objective around the structure and employing a time-aware guidance strategy, the method achieves consistent, meaningful inpainting results and reduces mismatch between masked and unmasked regions. A discriminator-based correlation measure and an adaptive resampling mechanism regulate structure-texture alignment, which is shown to improve performance across multiple datasets (PSV, CelebA, Places2) with higher PSNR/SSIM and lower FID. The approach demonstrates that time-varying structural guidance can effectively balance semantic consistency and richness, offering a practical advance for diffusion-based inpainting and related tasks; code is released at the provided URL.

Abstract

Denoising diffusion probabilistic models for image inpainting aim to add the noise to the texture of image during the forward process and recover masked regions with unmasked ones of the texture via the reverse denoising process. Despite the meaningful semantics generation, the existing arts suffer from the semantic discrepancy between masked and unmasked regions, since the semantically dense unmasked texture fails to be completely degraded while the masked regions turn to the pure noise in diffusion process, leading to the large discrepancy between them. In this paper, we aim to answer how unmasked semantics guide texture denoising process;together with how to tackle the semantic discrepancy, to facilitate the consistent and meaningful semantics generation. To this end, we propose a novel structure-guided diffusion model named StrDiffusion, to reformulate the conventional texture denoising process under structure guidance to derive a simplified denoising objective for image inpainting, while revealing: 1) the semantically sparse structure is beneficial to tackle semantic discrepancy in early stage, while dense texture generates reasonable semantics in late stage; 2) the semantics from unmasked regions essentially offer the time-dependent structure guidance for the texture denoising process, benefiting from the time-dependent sparsity of the structure semantics. For the denoising process, a structure-guided neural network is trained to estimate the simplified denoising objective by exploiting the consistency of the denoised structure between masked and unmasked regions. Besides, we devise an adaptive resampling strategy as a formal criterion as whether structure is competent to guide the texture denoising process, while regulate their semantic correlations. Extensive experiments validate the merits of StrDiffusion over the state-of-the-arts. Our code is available at https://github.com/htyjers/StrDiffusion.
Paper Structure (23 sections, 24 equations, 10 figures, 3 tables, 1 algorithm)

This paper contains 23 sections, 24 equations, 10 figures, 3 tables, 1 algorithm.

Figures (10)

  • Figure 1: Existing arts, e.g., IR-SDE luo2023image (a), suffer from the semantic discrepancy ($\bigcirc$) between the masked and unmasked regions despite of the meaningful semantics for the masked regions during the denoising process. Our StrDiffusion (b) tackles the semantic discrepancy issue via the guidance of the auxiliary sparse structure, yielding the consistent and meaningful denoised results. The experiments are conducted on PSV doersch2012makes.
  • Figure 2: Illustration of the motivating experiments about whether the sparse structure is beneficial to alleviating the discrepancy issue during the denoising process for image inpainting. Apart from the dense texture for IR-SDE luo2023image (a), the unmasked semantics combined with the Gaussian noise is further set as the sparse structure, e.g., the grayscale map (b) and edge map (c). Our StrDiffusion (d) can tackle the semantic discrepancy via the progressively sparse structure. The shadow area indicates the discrepancy between the masked and unmasked regions during the denoising process. The PSNR (higher is better) reflects the recovered semantics for the masked (unmasked) regions compared to the completed image (i.e., ground truth) by calculating the semantic similarity between them. The inpainted results are obtained by combining the masked regions of denoised results with the original masked images.
  • Figure 3: Illustration of the proposed StrDiffusion pipeline. Our basic idea is to tackle the semantic discrepancy between masked and unmasked regions via the guidance of the progressively sparse structure (a), which guides the texture denoising network (b) to generate the consistent and meaningful denoised results.
  • Figure 4: Illustration of the diffusion process for the dense texture (a) and sparse structure (b). In particular, the semantic sparsity of the structure is strengthened over time.
  • Figure 5: Illustration of the adaptive resampling strategy, which adaptively regulates the semantic correlation between the denoised texture and structure according to the score from the discriminator.
  • ...and 5 more figures