Table of Contents
Fetching ...

Visibility-Uncertainty-guided 3D Gaussian Inpainting via Scene Conceptional Learning

Mingxuan Cui, Qing Guo, Yuyi Wang, Hongkai Yu, Di Lin, Qin Zou, Ming-Ming Cheng, Xi Li

TL;DR

VISTA addresses the challenge of 3D Gaussian inpainting by jointly exploiting multi-view visual cues and scene semantics. It introduces visibility-uncertainty-guided 3D Gaussian inpainting (VISTA-GI) and scene conceptual learning (VISTA-CL), then iteratively combines them to fill masked regions with coherent content across views. The method uses visibility uncertainty maps to weight cross-view information and diffusion-based concept learning to fill regions lacking cues, enabling robust static and dynamic distractor removal. Evaluations on SPIn-NeRF and underwater UTB180-based datasets show superior 3D consistency and inpainting quality compared with state-of-the-art approaches, highlighting VISTA’s potential for AR/VR scene editing and dynamic scene reconstruction.

Abstract

3D Gaussian Splatting (3DGS) has emerged as a powerful and efficient 3D representation for novel view synthesis. This paper extends 3DGS capabilities to inpainting, where masked objects in a scene are replaced with new contents that blend seamlessly with the surroundings. Unlike 2D image inpainting, 3D Gaussian inpainting (3DGI) is challenging in effectively leveraging complementary visual and semantic cues from multiple input views, as occluded areas in one view may be visible in others. To address this, we propose a method that measures the visibility uncertainties of 3D points across different input views and uses them to guide 3DGI in utilizing complementary visual cues. We also employ uncertainties to learn a semantic concept of scene without the masked object and use a diffusion model to fill masked objects in input images based on the learned concept. Finally, we build a novel 3DGI framework, VISTA, by integrating VISibility-uncerTainty-guided 3DGI with scene conceptuAl learning. VISTA generates high-quality 3DGS models capable of synthesizing artifact-free and naturally inpainted novel views. Furthermore, our approach extends to handling dynamic distractors arising from temporal object changes, enhancing its versatility in diverse scene reconstruction scenarios. We demonstrate the superior performance of our method over state-of-the-art techniques using two challenging datasets: the SPIn-NeRF dataset, featuring 10 diverse static 3D inpainting scenes, and an underwater 3D inpainting dataset derived from UTB180, including fast-moving fish as inpainting targets.

Visibility-Uncertainty-guided 3D Gaussian Inpainting via Scene Conceptional Learning

TL;DR

VISTA addresses the challenge of 3D Gaussian inpainting by jointly exploiting multi-view visual cues and scene semantics. It introduces visibility-uncertainty-guided 3D Gaussian inpainting (VISTA-GI) and scene conceptual learning (VISTA-CL), then iteratively combines them to fill masked regions with coherent content across views. The method uses visibility uncertainty maps to weight cross-view information and diffusion-based concept learning to fill regions lacking cues, enabling robust static and dynamic distractor removal. Evaluations on SPIn-NeRF and underwater UTB180-based datasets show superior 3D consistency and inpainting quality compared with state-of-the-art approaches, highlighting VISTA’s potential for AR/VR scene editing and dynamic scene reconstruction.

Abstract

3D Gaussian Splatting (3DGS) has emerged as a powerful and efficient 3D representation for novel view synthesis. This paper extends 3DGS capabilities to inpainting, where masked objects in a scene are replaced with new contents that blend seamlessly with the surroundings. Unlike 2D image inpainting, 3D Gaussian inpainting (3DGI) is challenging in effectively leveraging complementary visual and semantic cues from multiple input views, as occluded areas in one view may be visible in others. To address this, we propose a method that measures the visibility uncertainties of 3D points across different input views and uses them to guide 3DGI in utilizing complementary visual cues. We also employ uncertainties to learn a semantic concept of scene without the masked object and use a diffusion model to fill masked objects in input images based on the learned concept. Finally, we build a novel 3DGI framework, VISTA, by integrating VISibility-uncerTainty-guided 3DGI with scene conceptuAl learning. VISTA generates high-quality 3DGS models capable of synthesizing artifact-free and naturally inpainted novel views. Furthermore, our approach extends to handling dynamic distractors arising from temporal object changes, enhancing its versatility in diverse scene reconstruction scenarios. We demonstrate the superior performance of our method over state-of-the-art techniques using two challenging datasets: the SPIn-NeRF dataset, featuring 10 diverse static 3D inpainting scenes, and an underwater 3D inpainting dataset derived from UTB180, including fast-moving fish as inpainting targets.

Paper Structure

This paper contains 27 sections, 10 equations, 12 figures, 7 tables.

Figures (12)

  • Figure 1: Two examples demonstrating the application of two state-of-the-art methods, namely InFusion liu2024infusion and GaussianGroup ye2023gaussian, alongside our proposed method for 3D Gaussian inpainting to fill masked static and dynamic objects, respectively. The red boxes highlight the advantages of our method and are enlarged on the right side of each image for better visibility. The white boxes and arrows indicate complementary visual cues between two different viewpoints
  • Figure 2: The f ramework of VISTA comprises two modules: VISTA-GI (described in \ref{['subsec:vista-gi']}) and VISTA-CL (detailed in \ref{['subsec:vista-cl']}). Results from three views are displayed for key variables in the framework. Note that $\mathcal{G}$, $\tilde{\mathcal{G}}^1$, $\tilde{\mathcal{G}}^2$, and $\tilde{\mathcal{G}}^3$ are 3DGS representations, and the displayed examples are rendered from these representations. The last column shows generated images derived from the learned scene concept. In the uncertainty map, we use ✫ to highlight areas of high uncertainty, which denote points (e.g., dynamic fishes) visible from only a few views. Yellow arrows demonstrate the progressive improvement in inpainting quality achieved by our method.
  • Figure 3: Example of dynamic inpainting on the Underwater 3D Inpainting Dataset.
  • Figure 4: Visualized examples of static inpainting on SPIn-NeRF.
  • Figure 5: (a) Relationship between 3DGS rendering quality and noise reduction ratios in diffusion. (b) Relationship between 3DGS rendering quality and increasing ratio of $\vartheta$.
  • ...and 7 more figures