Table of Contents
Fetching ...

VIPaint: Image Inpainting with Pre-Trained Diffusion Models via Variational Inference

Sakshi Agarwal, Gabe Hoope, Erik B. Sudderth

TL;DR

VIPaint addresses the problem of conditioning pre-trained diffusion priors on partial observations for image inpainting and other inverse problems. It introduces a hierarchical variational inference framework that marginalizes missing features and optimizes a non-Gaussian posterior over a mid-to-high range of diffusion times, followed by DPS-guided sampling to produce diverse, high-quality imputations. The approach yields superior plausibility and diversity across pixel-based and latent diffusion models, outperforming prior sampling and variational methods, and generalizes to tasks such as deblurring and super-resolution. By decoupling semantic global structure from fine-grained details through a two-stage process, VIPaint offers robust, scalable inference for challenging, high-uncertainty restoration scenarios.

Abstract

Diffusion probabilistic models learn to remove noise that is artificially added to the data during training. Novel data, like images, may then be generated from Gaussian noise through a sequence of denoising operations. While this Markov process implicitly defines a joint distribution over noise-free data, it is not simple to condition the generative process on masked or partial images. A number of heuristic sampling procedures have been proposed for solving inverse problems with diffusion priors, but these approaches do not directly approximate the true conditional distribution imposed by inference queries, and are often ineffective for large masked regions. Moreover, many of these baselines cannot be applied to latent diffusion models which use image encodings for efficiency. We instead develop a hierarchical variational inference algorithm that analytically marginalizes missing features, and uses a rigorous variational bound to optimize a non-Gaussian Markov approximation of the true diffusion posterior. Through extensive experiments with both pixel-based and latent diffusion models of images, we show that our VIPaint method significantly outperforms previous approaches in both the plausibility and diversity of imputations, and is easily generalized to other inverse problems like deblurring and superresolution.

VIPaint: Image Inpainting with Pre-Trained Diffusion Models via Variational Inference

TL;DR

VIPaint addresses the problem of conditioning pre-trained diffusion priors on partial observations for image inpainting and other inverse problems. It introduces a hierarchical variational inference framework that marginalizes missing features and optimizes a non-Gaussian posterior over a mid-to-high range of diffusion times, followed by DPS-guided sampling to produce diverse, high-quality imputations. The approach yields superior plausibility and diversity across pixel-based and latent diffusion models, outperforming prior sampling and variational methods, and generalizes to tasks such as deblurring and super-resolution. By decoupling semantic global structure from fine-grained details through a two-stage process, VIPaint offers robust, scalable inference for challenging, high-uncertainty restoration scenarios.

Abstract

Diffusion probabilistic models learn to remove noise that is artificially added to the data during training. Novel data, like images, may then be generated from Gaussian noise through a sequence of denoising operations. While this Markov process implicitly defines a joint distribution over noise-free data, it is not simple to condition the generative process on masked or partial images. A number of heuristic sampling procedures have been proposed for solving inverse problems with diffusion priors, but these approaches do not directly approximate the true conditional distribution imposed by inference queries, and are often ineffective for large masked regions. Moreover, many of these baselines cannot be applied to latent diffusion models which use image encodings for efficiency. We instead develop a hierarchical variational inference algorithm that analytically marginalizes missing features, and uses a rigorous variational bound to optimize a non-Gaussian Markov approximation of the true diffusion posterior. Through extensive experiments with both pixel-based and latent diffusion models of images, we show that our VIPaint method significantly outperforms previous approaches in both the plausibility and diversity of imputations, and is easily generalized to other inverse problems like deblurring and superresolution.

Paper Structure

This paper contains 57 sections, 29 equations, 19 figures, 4 tables.

Figures (19)

  • Figure 1: VIPaint inpainting with a pretrained, unconditional LDM stablediff of LSUN churches. For two image-mask pairs (left columns), we show the expected reconstruction for a sample taken from our hierarchical VIPaint posterior at times $\{T_e = 550, 500, 450, T_s = 400 \}$. We skip intermediate noise levels between these critical times during variational optimization, and add fine-grained details to our final inpaintings (three right columns, Inpainting 1 corresponds to $t=200$ sample) via 400 sequential denoising steps at times $0 \leq t < T_s$.
  • Figure 2: Red-Diff mardani2023variational defines its posterior directly on the space of unobserved pixels, and can be seen as a simplified special case of our VIPaint. Red-Diff produces blurry images, and direct optimization of its variational bound (RedDiff-V) leads to poor local optima, so must resort to annealing through diffusion time (RedDiff, $T \rightarrow 0$). Our VIPaint method suffers from no such difficulty and uses the variational bound for optimization, leading to more robust inference.
  • Figure 3: Top: The hierarchical approximate posterior of VIPaint is defined over a coarse sequence of intermediate latent steps between $T_e$ and $T_s$. During optimization, the variational parameters $\lambda$ defining the posterior on a subset of latent times are fit via a prior loss on times above $T_e$, a hierarchical loss defined across $K$ intermediate times, and a reconstruction loss estimated using a one-step approximation $p_\theta(x|z_{T_s})$ from the posterior samples. Bottom: After variational inference, samples from the hierarchical posterior (now aligned with the observation) transition smoothly in the intermediate latent space $[0, T_s]$ via gradient updates. Note that samples at $T_e$ and $T_s$ are aligned much better for VIPaint then the baseline PSLD rout2023solving, whose predications at $T_e=550$ contain artifacts which subsequent steps cannot correct.
  • Figure 4: We show the progress of fitting VIPaint's posterior and draw samples after every 50 iterations of inference for two test cases. We see that VIPaint quickly figures out the semantics in the underlying image within 50 optimization iterations.
  • Figure 5: Image completion results on Imagenet256 using the LDM prior for Rotated Window and Random Masking schemes shown in the second row. We show an inpainting from each method in the following four rows. DPS, PSLD, and ReSample show blurry inpaintings of widely varying quality. In contrast, VIPaint interprets the global semantics in the observed image and produces very realistic images. Please find more qualitative plots for LSUN-church in the Appendix Fig. \ref{['fig:consistency-lsun']}.
  • ...and 14 more figures