3D-Consistent Image Inpainting with Diffusion Models
Leonid Antsfeld, Boris Chidlovskii
TL;DR
The paper tackles 3D inconsistency in diffusion-based image inpainting by introducing InConDiff, a method that trains unconditional diffusion models with image pairs from the same scene to inject 3D priors via an additional viewpoint in the denoising process. It integrates in-context guidance from a second image and conditioning on the known region during inference, using a Diffusion Transformer to enable cross-view completion and harmonize masked and unmasked regions. Through experiments on HM3D, MegaDepth, StreetView, and WalkingTour, InConDiff achieves semantically coherent and 3D-consistent inpaintings and outperforms state-of-the-art methods, with ablations showing robustness to masking ratios and benefits from a Laplace-based noise schedule. The work demonstrates that explicit 3D supervision is not required to enforce 3D priors, offering a practical approach to more realistic inpainting in complex scenes and providing a foundation for future multi-view or video-consistent diffusion methods.
Abstract
We address the problem of 3D inconsistency of image inpainting based on diffusion models. We propose a generative model using image pairs that belong to the same scene. To achieve the 3D-consistent and semantically coherent inpainting, we modify the generative diffusion model by incorporating an alternative point of view of the scene into the denoising process. This creates an inductive bias that allows to recover 3D priors while training to denoise in 2D, without explicit 3D supervision. Training unconditional diffusion models with additional images as in-context guidance allows to harmonize the masked and non-masked regions while repainting and ensures the 3D consistency. We evaluate our method on one synthetic and three real-world datasets and show that it generates semantically coherent and 3D-consistent inpaintings and outperforms the state-of-art methods.
