Table of Contents
Fetching ...

3D-Consistent Image Inpainting with Diffusion Models

Leonid Antsfeld, Boris Chidlovskii

TL;DR

The paper tackles 3D inconsistency in diffusion-based image inpainting by introducing InConDiff, a method that trains unconditional diffusion models with image pairs from the same scene to inject 3D priors via an additional viewpoint in the denoising process. It integrates in-context guidance from a second image and conditioning on the known region during inference, using a Diffusion Transformer to enable cross-view completion and harmonize masked and unmasked regions. Through experiments on HM3D, MegaDepth, StreetView, and WalkingTour, InConDiff achieves semantically coherent and 3D-consistent inpaintings and outperforms state-of-the-art methods, with ablations showing robustness to masking ratios and benefits from a Laplace-based noise schedule. The work demonstrates that explicit 3D supervision is not required to enforce 3D priors, offering a practical approach to more realistic inpainting in complex scenes and providing a foundation for future multi-view or video-consistent diffusion methods.

Abstract

We address the problem of 3D inconsistency of image inpainting based on diffusion models. We propose a generative model using image pairs that belong to the same scene. To achieve the 3D-consistent and semantically coherent inpainting, we modify the generative diffusion model by incorporating an alternative point of view of the scene into the denoising process. This creates an inductive bias that allows to recover 3D priors while training to denoise in 2D, without explicit 3D supervision. Training unconditional diffusion models with additional images as in-context guidance allows to harmonize the masked and non-masked regions while repainting and ensures the 3D consistency. We evaluate our method on one synthetic and three real-world datasets and show that it generates semantically coherent and 3D-consistent inpaintings and outperforms the state-of-art methods.

3D-Consistent Image Inpainting with Diffusion Models

TL;DR

The paper tackles 3D inconsistency in diffusion-based image inpainting by introducing InConDiff, a method that trains unconditional diffusion models with image pairs from the same scene to inject 3D priors via an additional viewpoint in the denoising process. It integrates in-context guidance from a second image and conditioning on the known region during inference, using a Diffusion Transformer to enable cross-view completion and harmonize masked and unmasked regions. Through experiments on HM3D, MegaDepth, StreetView, and WalkingTour, InConDiff achieves semantically coherent and 3D-consistent inpaintings and outperforms state-of-the-art methods, with ablations showing robustness to masking ratios and benefits from a Laplace-based noise schedule. The work demonstrates that explicit 3D supervision is not required to enforce 3D priors, offering a practical approach to more realistic inpainting in complex scenes and providing a foundation for future multi-view or video-consistent diffusion methods.

Abstract

We address the problem of 3D inconsistency of image inpainting based on diffusion models. We propose a generative model using image pairs that belong to the same scene. To achieve the 3D-consistent and semantically coherent inpainting, we modify the generative diffusion model by incorporating an alternative point of view of the scene into the denoising process. This creates an inductive bias that allows to recover 3D priors while training to denoise in 2D, without explicit 3D supervision. Training unconditional diffusion models with additional images as in-context guidance allows to harmonize the masked and non-masked regions while repainting and ensures the 3D consistency. We evaluate our method on one synthetic and three real-world datasets and show that it generates semantically coherent and 3D-consistent inpaintings and outperforms the state-of-art methods.

Paper Structure

This paper contains 11 sections, 7 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 2: 3D-consistent inpainting with InConDiff: a) original image, b) image with masked occlusions; c) in-context image; d) InConDiff inpainting preserving 3D consistency.
  • Figure 3: Unconditional DDPM with the forward process (right arrows) and the reverse denoising process (left arrows) that takes into account additional image $\mathbf{x}'$.
  • Figure 4: ViT architecture for learning the denoising model with additional image $\mathbf{x}'$.
  • Figure 5: Image pairs for training unconditional DDPMs from HM3D, MegaDepth, StreetView and WalkingTour datasets.
  • Figure 6: Inpainting with semantic mask using noise schedule with jumps. Note resampling effect for harmonization.
  • ...and 4 more figures