Table of Contents
Fetching ...

Generative Object Insertion in Gaussian Splatting with a Multi-View Diffusion Model

Hongliang Zhong, Can Wang, Jingbo Zhang, Jing Liao

TL;DR

The paper tackles the challenge of inserting new objects into 3D scenes represented by Gaussian Splatting with view-consistent quality. It introduces MVInpainter, a multi-view diffusion model built atop Stable Video Diffusion, augmented with a ControlNet-based conditioning path to enforce view-aware inpainting across multiple viewpoints. A mask-aware reconstruction stage then refines the edited Gaussian Splatting by leveraging both inpainted views and training views, reducing artifacts and preserving scene background. Quantitative and qualitative results show superior view-consistency, object quality, and scene harmony compared to SDS-based and single-view-inpainting baselines, indicating meaningful gains for 3D content creation in VR, gaming, and digital media. Limitations include data scarcity for full 360-degree coverage, object removal challenges, and shadows, suggesting avenues for future work.

Abstract

Generating and inserting new objects into 3D content is a compelling approach for achieving versatile scene recreation. Existing methods, which rely on SDS optimization or single-view inpainting, often struggle to produce high-quality results. To address this, we propose a novel method for object insertion in 3D content represented by Gaussian Splatting. Our approach introduces a multi-view diffusion model, dubbed MVInpainter, which is built upon a pre-trained stable video diffusion model to facilitate view-consistent object inpainting. Within MVInpainter, we incorporate a ControlNet-based conditional injection module to enable controlled and more predictable multi-view generation. After generating the multi-view inpainted results, we further propose a mask-aware 3D reconstruction technique to refine Gaussian Splatting reconstruction from these sparse inpainted views. By leveraging these fabricate techniques, our approach yields diverse results, ensures view-consistent and harmonious insertions, and produces better object quality. Extensive experiments demonstrate that our approach outperforms existing methods.

Generative Object Insertion in Gaussian Splatting with a Multi-View Diffusion Model

TL;DR

The paper tackles the challenge of inserting new objects into 3D scenes represented by Gaussian Splatting with view-consistent quality. It introduces MVInpainter, a multi-view diffusion model built atop Stable Video Diffusion, augmented with a ControlNet-based conditioning path to enforce view-aware inpainting across multiple viewpoints. A mask-aware reconstruction stage then refines the edited Gaussian Splatting by leveraging both inpainted views and training views, reducing artifacts and preserving scene background. Quantitative and qualitative results show superior view-consistency, object quality, and scene harmony compared to SDS-based and single-view-inpainting baselines, indicating meaningful gains for 3D content creation in VR, gaming, and digital media. Limitations include data scarcity for full 360-degree coverage, object removal challenges, and shadows, suggesting avenues for future work.

Abstract

Generating and inserting new objects into 3D content is a compelling approach for achieving versatile scene recreation. Existing methods, which rely on SDS optimization or single-view inpainting, often struggle to produce high-quality results. To address this, we propose a novel method for object insertion in 3D content represented by Gaussian Splatting. Our approach introduces a multi-view diffusion model, dubbed MVInpainter, which is built upon a pre-trained stable video diffusion model to facilitate view-consistent object inpainting. Within MVInpainter, we incorporate a ControlNet-based conditional injection module to enable controlled and more predictable multi-view generation. After generating the multi-view inpainted results, we further propose a mask-aware 3D reconstruction technique to refine Gaussian Splatting reconstruction from these sparse inpainted views. By leveraging these fabricate techniques, our approach yields diverse results, ensures view-consistent and harmonious insertions, and produces better object quality. Extensive experiments demonstrate that our approach outperforms existing methods.
Paper Structure (15 sections, 3 equations, 14 figures, 3 tables)

This paper contains 15 sections, 3 equations, 14 figures, 3 tables.

Figures (14)

  • Figure 1: Method Framework. The framework is divided into three main parts: Inputs, MVInpainter for Inpainting, and Mask-aware Reconstruction. Initially, the inputs include the original Gaussian scene $\Theta$, a bounding box (BBox) $b$, and a text prompt $y$. For the MVInpainter to perform inpainting, background images $I^{bg}$, masks $M$, and depth maps $D$ are first derived. Using these inputs, along with the conditioning input $I^c$, the MVInpainter generates consistent inpainted views $I'$. Finally, in the Mask-aware Reconstruction phase, the inpainted Gaussian scene $\Theta'$ is reconstructed using both inpainted and original training views for novel view synthesis, guided by a mask derived from the BBox $b$.
  • Figure 2: Bounding Box Placement. The placement of the bounding box enables the user to freely define the editing region in a convenient and intuitive manner.
  • Figure 3: MVInpainter Pipeline. The MVInpainter integrates a multi-view diffusion module (MVD) and a ControlNet-based condition injection module to achieve view-consistent inpainting across multiple viewpoints. The MVD, adapted from the Stable Video Diffusion (SVD) model, generates consistent multi-view outputs conditioned on an image $I^{c}$, while the ControlNet module refines the inpainting by guiding the foreground content, managing the background appearances, and controlling the trajectory of the generated outputs conditioned on the depth $D$, the background $I^{bg}$, and the mask $M$.
  • Figure 4: Qualitative Comparison with Other State-of-the-art Methods. Our approach directly generates view-consistent appearances across multiple viewpoints, bypassing SDS optimization and ensuring harmonious integration of generated objects with the scene from various angles. As a result, it achieves the most authentic generative object insertion.
  • Figure 5: Editing Results in Various Scenes. Clearly, our model can achieve realistic and effective editing in diverse scenes.
  • ...and 9 more figures