InFusion: Inpainting 3D Gaussians via Learning Depth Completion from Diffusion Prior
Zhiheng Liu, Hao Ouyang, Qiuyu Wang, Ka Leong Cheng, Jie Xiao, Kai Zhu, Nan Xue, Yu Liu, Yujun Shen, Yang Cao
TL;DR
InFusion tackles the challenge of editing 3D Gaussian scenes by enabling robust depth-informed inpainting guided by diffusion priors. The method trains a latent-diffusion depth completion model that takes a depth map $d$, an image $I$, and a mask $m$ to produce depth that aligns with the original view and can be unprojected into 3D space for initialization of 3D Gaussians, followed by a lightweight fine-tuning phase. A key contribution is the diffusion-based depth completion objective and a progressive inpainting strategy that handles occlusions across multiple reference views, achieving higher fidelity and faster results than prior baselines, plus practical applications in texture editing and object insertion. The results demonstrate substantially improved LPIPS and FID scores and up to roughly two orders of magnitude speed-up, illustrating the potential of diffusion priors to enhance 3D scene editing and novel-view synthesis, while acknowledging limitations under lighting changes and complex 360-degree edits. For example, the paper notes the loss function for view $s(i_1)$ as $\ \mathcal{L}_{s(i_1)} = (1-\lambda)\|I'_{s(i_1)}-\tilde{I}_{s(i_1)}\|_1 + \lambda\cdot\text{D-SSIM}(I'_{s(i_1)}, \tilde{I}_{s(i_1)})$ with $\lambda=0.2$ and depth completion via diffusion steps in latent space, highlighting the core mechanics that enable efficient, coherent 3D inpainting.
Abstract
3D Gaussians have recently emerged as an efficient representation for novel view synthesis. This work studies its editability with a particular focus on the inpainting task, which aims to supplement an incomplete set of 3D Gaussians with additional points for visually harmonious rendering. Compared to 2D inpainting, the crux of inpainting 3D Gaussians is to figure out the rendering-relevant properties of the introduced points, whose optimization largely benefits from their initial 3D positions. To this end, we propose to guide the point initialization with an image-conditioned depth completion model, which learns to directly restore the depth map based on the observed image. Such a design allows our model to fill in depth values at an aligned scale with the original depth, and also to harness strong generalizability from largescale diffusion prior. Thanks to the more accurate depth completion, our approach, dubbed InFusion, surpasses existing alternatives with sufficiently better fidelity and efficiency under various complex scenarios. We further demonstrate the effectiveness of InFusion with several practical applications, such as inpainting with user-specific texture or with novel object insertion.
