Table of Contents
Fetching ...

InFusion: Inpainting 3D Gaussians via Learning Depth Completion from Diffusion Prior

Zhiheng Liu, Hao Ouyang, Qiuyu Wang, Ka Leong Cheng, Jie Xiao, Kai Zhu, Nan Xue, Yu Liu, Yujun Shen, Yang Cao

TL;DR

InFusion tackles the challenge of editing 3D Gaussian scenes by enabling robust depth-informed inpainting guided by diffusion priors. The method trains a latent-diffusion depth completion model that takes a depth map $d$, an image $I$, and a mask $m$ to produce depth that aligns with the original view and can be unprojected into 3D space for initialization of 3D Gaussians, followed by a lightweight fine-tuning phase. A key contribution is the diffusion-based depth completion objective and a progressive inpainting strategy that handles occlusions across multiple reference views, achieving higher fidelity and faster results than prior baselines, plus practical applications in texture editing and object insertion. The results demonstrate substantially improved LPIPS and FID scores and up to roughly two orders of magnitude speed-up, illustrating the potential of diffusion priors to enhance 3D scene editing and novel-view synthesis, while acknowledging limitations under lighting changes and complex 360-degree edits. For example, the paper notes the loss function for view $s(i_1)$ as $\ \mathcal{L}_{s(i_1)} = (1-\lambda)\|I'_{s(i_1)}-\tilde{I}_{s(i_1)}\|_1 + \lambda\cdot\text{D-SSIM}(I'_{s(i_1)}, \tilde{I}_{s(i_1)})$ with $\lambda=0.2$ and depth completion via diffusion steps in latent space, highlighting the core mechanics that enable efficient, coherent 3D inpainting.

Abstract

3D Gaussians have recently emerged as an efficient representation for novel view synthesis. This work studies its editability with a particular focus on the inpainting task, which aims to supplement an incomplete set of 3D Gaussians with additional points for visually harmonious rendering. Compared to 2D inpainting, the crux of inpainting 3D Gaussians is to figure out the rendering-relevant properties of the introduced points, whose optimization largely benefits from their initial 3D positions. To this end, we propose to guide the point initialization with an image-conditioned depth completion model, which learns to directly restore the depth map based on the observed image. Such a design allows our model to fill in depth values at an aligned scale with the original depth, and also to harness strong generalizability from largescale diffusion prior. Thanks to the more accurate depth completion, our approach, dubbed InFusion, surpasses existing alternatives with sufficiently better fidelity and efficiency under various complex scenarios. We further demonstrate the effectiveness of InFusion with several practical applications, such as inpainting with user-specific texture or with novel object insertion.

InFusion: Inpainting 3D Gaussians via Learning Depth Completion from Diffusion Prior

TL;DR

InFusion tackles the challenge of editing 3D Gaussian scenes by enabling robust depth-informed inpainting guided by diffusion priors. The method trains a latent-diffusion depth completion model that takes a depth map , an image , and a mask to produce depth that aligns with the original view and can be unprojected into 3D space for initialization of 3D Gaussians, followed by a lightweight fine-tuning phase. A key contribution is the diffusion-based depth completion objective and a progressive inpainting strategy that handles occlusions across multiple reference views, achieving higher fidelity and faster results than prior baselines, plus practical applications in texture editing and object insertion. The results demonstrate substantially improved LPIPS and FID scores and up to roughly two orders of magnitude speed-up, illustrating the potential of diffusion priors to enhance 3D scene editing and novel-view synthesis, while acknowledging limitations under lighting changes and complex 360-degree edits. For example, the paper notes the loss function for view as with and depth completion via diffusion steps in latent space, highlighting the core mechanics that enable efficient, coherent 3D inpainting.

Abstract

3D Gaussians have recently emerged as an efficient representation for novel view synthesis. This work studies its editability with a particular focus on the inpainting task, which aims to supplement an incomplete set of 3D Gaussians with additional points for visually harmonious rendering. Compared to 2D inpainting, the crux of inpainting 3D Gaussians is to figure out the rendering-relevant properties of the introduced points, whose optimization largely benefits from their initial 3D positions. To this end, we propose to guide the point initialization with an image-conditioned depth completion model, which learns to directly restore the depth map based on the observed image. Such a design allows our model to fill in depth values at an aligned scale with the original depth, and also to harness strong generalizability from largescale diffusion prior. Thanks to the more accurate depth completion, our approach, dubbed InFusion, surpasses existing alternatives with sufficiently better fidelity and efficiency under various complex scenarios. We further demonstrate the effectiveness of InFusion with several practical applications, such as inpainting with user-specific texture or with novel object insertion.
Paper Structure (24 sections, 7 equations, 11 figures, 1 table)

This paper contains 24 sections, 7 equations, 11 figures, 1 table.

Figures (11)

  • Figure 1: Illustration of Infusion driven by Depth Inpainting. Top: To remove a target from the optimized 3D Gaussians, our InFusion first inpaints a selected one-view RGB image and applies the proposed diffusion model for depth inpainting to the depth projection of the targeted 3D Gaussians. The progressive scheme addresses view-dependent occlusion issues by utilizing other unobstructed viewpoints. Bottom: A detailed view of the training pipeline for the depth inpainting U-Net is presented. We employ a mask-driven denoising diffusion for training of the U-Net, which utilizes a frozen latent tokenizer by taking the RGB image and depth map as inputs.
  • Figure 1: Analysis on Pre-trained Weights.
  • Figure 2: Qualitative Comparison with Baselines. Zoom in for details. Our method exhibits sharp textures that maintain 3D coherence, whereas baseline approaches often yield details that appear blurred.
  • Figure 2: Analysis on Depth Inpainting. It is evident that the image-based inpainting models, lacking proper guidance, fail to adequately complete the geometric details. Regarding the monocular estimation methods, while a depth alignment method is implemented, they often lead to discontinuities within the inpainted regions
  • Figure 3: Qualitative Comparison with Baselines. We delve into more challenging scenarios, including those with multi-object occlusion, where our method uniquely stands out by accurately inpainting the obscured missing segments.
  • ...and 6 more figures