Table of Contents
Fetching ...

RefFusion: Reference Adapted Diffusion Models for 3D Scene Inpainting

Ashkan Mirzaei, Riccardo De Lutio, Seung Wook Kim, David Acuna, Jonathan Kelly, Sanja Fidler, Igor Gilitschenski, Zan Gojcic

TL;DR

RefFusion tackles editable 3D scene reconstruction by distilling priors from a reference-adapted diffusion model into a 3D Gaussian-splat representation. It achieves this through multi-scale personalization of a 2D inpainting diffusion model using LoRA, followed by SDS-based distillation that operates at the scene level, guided by per-Gaussian masking to ensure region-specific updates. Key innovations include separate global and local SDS losses, 3D-consistent masking of Gaussian particles, and a depth-regularized adversarial objective, all contributing to sharper, more controllable inpaintings with high multi-view coherence. The approach demonstrates strong object removal performance, extends to object insertion, scene outpainting, and sparse-view reconstruction, and offers practical benefits for AR/VR and robotics applications.

Abstract

Neural reconstruction approaches are rapidly emerging as the preferred representation for 3D scenes, but their limited editability is still posing a challenge. In this work, we propose an approach for 3D scene inpainting -- the task of coherently replacing parts of the reconstructed scene with desired content. Scene inpainting is an inherently ill-posed task as there exist many solutions that plausibly replace the missing content. A good inpainting method should therefore not only enable high-quality synthesis but also a high degree of control. Based on this observation, we focus on enabling explicit control over the inpainted content and leverage a reference image as an efficient means to achieve this goal. Specifically, we introduce RefFusion, a novel 3D inpainting method based on a multi-scale personalization of an image inpainting diffusion model to the given reference view. The personalization effectively adapts the prior distribution to the target scene, resulting in a lower variance of score distillation objective and hence significantly sharper details. Our framework achieves state-of-the-art results for object removal while maintaining high controllability. We further demonstrate the generality of our formulation on other downstream tasks such as object insertion, scene outpainting, and sparse view reconstruction.

RefFusion: Reference Adapted Diffusion Models for 3D Scene Inpainting

TL;DR

RefFusion tackles editable 3D scene reconstruction by distilling priors from a reference-adapted diffusion model into a 3D Gaussian-splat representation. It achieves this through multi-scale personalization of a 2D inpainting diffusion model using LoRA, followed by SDS-based distillation that operates at the scene level, guided by per-Gaussian masking to ensure region-specific updates. Key innovations include separate global and local SDS losses, 3D-consistent masking of Gaussian particles, and a depth-regularized adversarial objective, all contributing to sharper, more controllable inpaintings with high multi-view coherence. The approach demonstrates strong object removal performance, extends to object insertion, scene outpainting, and sparse-view reconstruction, and offers practical benefits for AR/VR and robotics applications.

Abstract

Neural reconstruction approaches are rapidly emerging as the preferred representation for 3D scenes, but their limited editability is still posing a challenge. In this work, we propose an approach for 3D scene inpainting -- the task of coherently replacing parts of the reconstructed scene with desired content. Scene inpainting is an inherently ill-posed task as there exist many solutions that plausibly replace the missing content. A good inpainting method should therefore not only enable high-quality synthesis but also a high degree of control. Based on this observation, we focus on enabling explicit control over the inpainted content and leverage a reference image as an efficient means to achieve this goal. Specifically, we introduce RefFusion, a novel 3D inpainting method based on a multi-scale personalization of an image inpainting diffusion model to the given reference view. The personalization effectively adapts the prior distribution to the target scene, resulting in a lower variance of score distillation objective and hence significantly sharper details. Our framework achieves state-of-the-art results for object removal while maintaining high controllability. We further demonstrate the generality of our formulation on other downstream tasks such as object insertion, scene outpainting, and sparse view reconstruction.
Paper Structure (13 sections, 7 equations, 10 figures, 3 tables)

This paper contains 13 sections, 7 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: Comparison of 2D image inpainting methods on multiple views of the same scene. LaMa suvorov2022resolution yields relatively consistent inpaintings but lacks details. SDXL podell2023sdxl synthesizes content with high-quality, but low multi-view consistency due to the high diversity of its generations. By personalizing the diffusion model to the reference view, our method achieves high-quality generations with superior multi-view consistency. Ours #1 and Ours #2 are adapted to SDXL outputs shown in the second and third row respectively.
  • Figure 2: Overview of the proposed approach. RefFusion takes training views, masks, and the reference view as input (left). We adapt the inpainting LDM on both the global and local crops of the reference view (middle). Then, we distill the priors of the adapted LDM to the scene (right) by minimizing the SDS objective. Additionally, we use a discriminator loss to mitigate potential artifacts in appearance and a depth loss to enhance geometry. We track Gaussians representing the masked and unmasked regions, and backpropagate the gradients of individual terms only to the pertinent regions.
  • Figure 3: Qualitative object removal results on the SPIn-NeRF dataset. RefFusion consistently outperforms the baselines, yielding sharper reconstruction and more plausible inpainting.
  • Figure 4: Qualitative object removal results on scenes with larger camera movements (MipNeRF360 dataset barron2022mipnerf360 and scenes from our proposed dataset). RefFusion consistently outperforms the Reference-guided NeRF.
  • Figure 5: Results of the sparse view reconstruction on SPIn-NeRF dataset. Using the sparse GT views only for personalization Ours (LoRA) already yields competitive results. When combined with the reconstruction loss Ours (LoRA + recon) consistently outperforms 3DGS kerbl20233Dgaussians, showcasing the potential of generative priors to guide 3D reconstruction.
  • ...and 5 more figures