VIPaint: Image Inpainting with Pre-Trained Diffusion Models via Variational Inference
Sakshi Agarwal, Gabe Hoope, Erik B. Sudderth
TL;DR
VIPaint addresses the problem of conditioning pre-trained diffusion priors on partial observations for image inpainting and other inverse problems. It introduces a hierarchical variational inference framework that marginalizes missing features and optimizes a non-Gaussian posterior over a mid-to-high range of diffusion times, followed by DPS-guided sampling to produce diverse, high-quality imputations. The approach yields superior plausibility and diversity across pixel-based and latent diffusion models, outperforming prior sampling and variational methods, and generalizes to tasks such as deblurring and super-resolution. By decoupling semantic global structure from fine-grained details through a two-stage process, VIPaint offers robust, scalable inference for challenging, high-uncertainty restoration scenarios.
Abstract
Diffusion probabilistic models learn to remove noise that is artificially added to the data during training. Novel data, like images, may then be generated from Gaussian noise through a sequence of denoising operations. While this Markov process implicitly defines a joint distribution over noise-free data, it is not simple to condition the generative process on masked or partial images. A number of heuristic sampling procedures have been proposed for solving inverse problems with diffusion priors, but these approaches do not directly approximate the true conditional distribution imposed by inference queries, and are often ineffective for large masked regions. Moreover, many of these baselines cannot be applied to latent diffusion models which use image encodings for efficiency. We instead develop a hierarchical variational inference algorithm that analytically marginalizes missing features, and uses a rigorous variational bound to optimize a non-Gaussian Markov approximation of the true diffusion posterior. Through extensive experiments with both pixel-based and latent diffusion models of images, we show that our VIPaint method significantly outperforms previous approaches in both the plausibility and diversity of imputations, and is easily generalized to other inverse problems like deblurring and superresolution.
