Table of Contents
Fetching ...

Image Inpainting via Tractable Steering of Diffusion Models

Anji Liu, Mathias Niepert, Guy Van den Broeck

TL;DR

This work tackles constrained image generation by integrating tractable probabilistic models with diffusion-based inpainting. It introduces Tiramisu, a framework that uses Probabilistic Circuits to compute the exact constrained posterior $p_{ ext{TPM}}( ilde{oldsymbol{x}}_0|oldsymbol{x}_t,oldsymbol{x}_0^{ ext{k}})$ and steers the diffusion denoising process by fusing it with the unconditional posterior through a weighted geometric mean, controlled by a mixing parameter. The approach scales to high-resolution images by operating in a latent space via VQ-GAN and Monte Carlo estimation of latent soft-evidence, achieving consistent improvements in semantic coherence and fidelity on CelebA-HQ, ImageNet, and LSUN with around 10% additional computation. Empirical results include qualitative steering visualizations, quantitative LPIPS gains over multiple baselines, and a semantic fusion capability that leverages reference patches for more constrained generation. The findings highlight the value of tractable models in enabling more controllable, high-quality image generation and suggest broad applicability to other constrained generation tasks.

Abstract

Diffusion models are the current state of the art for generating photorealistic images. Controlling the sampling process for constrained image generation tasks such as inpainting, however, remains challenging since exact conditioning on such constraints is intractable. While existing methods use various techniques to approximate the constrained posterior, this paper proposes to exploit the ability of Tractable Probabilistic Models (TPMs) to exactly and efficiently compute the constrained posterior, and to leverage this signal to steer the denoising process of diffusion models. Specifically, this paper adopts a class of expressive TPMs termed Probabilistic Circuits (PCs). Building upon prior advances, we further scale up PCs and make them capable of guiding the image generation process of diffusion models. Empirical results suggest that our approach can consistently improve the overall quality and semantic coherence of inpainted images across three natural image datasets (i.e., CelebA-HQ, ImageNet, and LSUN) with only $\sim\! 10 \%$ additional computational overhead brought by the TPM. Further, with the help of an image encoder and decoder, our method can readily accept semantic constraints on specific regions of the image, which opens up the potential for more controlled image generation tasks. In addition to proposing a new framework for constrained image generation, this paper highlights the benefit of more tractable models and motivates the development of expressive TPMs.

Image Inpainting via Tractable Steering of Diffusion Models

TL;DR

This work tackles constrained image generation by integrating tractable probabilistic models with diffusion-based inpainting. It introduces Tiramisu, a framework that uses Probabilistic Circuits to compute the exact constrained posterior and steers the diffusion denoising process by fusing it with the unconditional posterior through a weighted geometric mean, controlled by a mixing parameter. The approach scales to high-resolution images by operating in a latent space via VQ-GAN and Monte Carlo estimation of latent soft-evidence, achieving consistent improvements in semantic coherence and fidelity on CelebA-HQ, ImageNet, and LSUN with around 10% additional computation. Empirical results include qualitative steering visualizations, quantitative LPIPS gains over multiple baselines, and a semantic fusion capability that leverages reference patches for more constrained generation. The findings highlight the value of tractable models in enabling more controllable, high-quality image generation and suggest broad applicability to other constrained generation tasks.

Abstract

Diffusion models are the current state of the art for generating photorealistic images. Controlling the sampling process for constrained image generation tasks such as inpainting, however, remains challenging since exact conditioning on such constraints is intractable. While existing methods use various techniques to approximate the constrained posterior, this paper proposes to exploit the ability of Tractable Probabilistic Models (TPMs) to exactly and efficiently compute the constrained posterior, and to leverage this signal to steer the denoising process of diffusion models. Specifically, this paper adopts a class of expressive TPMs termed Probabilistic Circuits (PCs). Building upon prior advances, we further scale up PCs and make them capable of guiding the image generation process of diffusion models. Empirical results suggest that our approach can consistently improve the overall quality and semantic coherence of inpainted images across three natural image datasets (i.e., CelebA-HQ, ImageNet, and LSUN) with only additional computational overhead brought by the TPM. Further, with the help of an image encoder and decoder, our method can readily accept semantic constraints on specific regions of the image, which opens up the potential for more controlled image generation tasks. In addition to proposing a new framework for constrained image generation, this paper highlights the benefit of more tractable models and motivates the development of expressive TPMs.
Paper Structure (34 sections, 1 theorem, 17 equations, 10 figures, 5 tables)

This paper contains 34 sections, 1 theorem, 17 equations, 10 figures, 5 tables.

Key Result

Theorem 1

For any smooth and decomposable PC ${p}(\mathbf{X})$ and univariate weight functions $\{w_i (X_i)\}_i$, define ${p}'(\boldsymbol{x}) = \frac{1}{Z} \prod_i w_i (x_i) \cdot {p}(\boldsymbol{x})$, where the normalizing constant $Z := \sum_{\boldsymbol{x}} \prod_i w_i (x_i) \cdot {p}(\boldsymbol{x})$. As

Figures (10)

  • Figure 1: Illustration of the steering effect of the TPM on the diffusion model. The same random seed is used by the baseline (CoPaint; zhang2023towards) and our approach. At every time step, given the image at the previous noise level, Tiramisu reconstructs $\tilde{\boldsymbol{x}}_0$ with both the diffusion model and the TPM, and combines the two distributions by taking their geometric mean (solid arrows). The images then go through the noising process to generate the input for the previous time step (dashed arrows).
  • Figure 2: An example PC over boolean variables $X_1, \dots, X_4$. Sum parameters are labeled on the corresponding edges. The probability of every node given input $x_1 \bar{x_2} \bar{x_3} x_4$ is labeled blue on top of the corresponding node.
  • Figure 3: Qualitative results on all three adopted datasets. We compare Tiramisu against six diffusion-based inpainting algorithms. Please refer to \ref{['appx:qualitative-results']} for more qualitative results.
  • Figure 4: Performance and runtime.
  • Figure 5: CelebA-HQ qualitative results for the semantic fusion task. In every sample, two reference images together with their masks are provided to Tiramisu. The task is to generate images that (i) semantically align with the unmasked region of both reference images, and (ii) have high fidelity. For every input, we generate five samples with different levels of semantic coherence. The left-most images are the least semantically constrained and barely match the semantic patterns of the reference images. In contrast, the right-most images strictly match the semantics of the reference images.
  • ...and 5 more figures

Theorems & Definitions (3)

  • Definition 1: Probabilistic Circuits
  • Definition 2: Smoothness and Decomposability
  • Theorem 1