Table of Contents
Fetching ...

GScream: Learning 3D Geometry and Feature Consistent Gaussian Splatting for Object Removal

Yuxin Wang, Qianyi Wu, Guofeng Zhang, Dan Xu

TL;DR

GScream addresses object removal in scenes represented by $3D$ Gaussian Splatting, aiming to preserve multi-view geometry and texture coherence while enabling fast training and rendering. It couples monocular depth guidance for geometry refinement with a bidirectional cross-attention mechanism to transfer texture information between in-painted and visible regions, all within a lightweight Scaffold-GS framework. A depth-alignment strategy using scale $w$ and shift $q$ stabilizes monocular depth signals during optimization. On SPIn-NeRF and IBRNet, GScream delivers competitive or superior novel-view synthesis quality and significant efficiency gains over NeRF-based methods.

Abstract

This paper tackles the intricate challenge of object removal to update the radiance field using the 3D Gaussian Splatting. The main challenges of this task lie in the preservation of geometric consistency and the maintenance of texture coherence in the presence of the substantial discrete nature of Gaussian primitives. We introduce a robust framework specifically designed to overcome these obstacles. The key insight of our approach is the enhancement of information exchange among visible and invisible areas, facilitating content restoration in terms of both geometry and texture. Our methodology begins with optimizing the positioning of Gaussian primitives to improve geometric consistency across both removed and visible areas, guided by an online registration process informed by monocular depth estimation. Following this, we employ a novel feature propagation mechanism to bolster texture coherence, leveraging a cross-attention design that bridges sampling Gaussians from both uncertain and certain areas. This innovative approach significantly refines the texture coherence within the final radiance field. Extensive experiments validate that our method not only elevates the quality of novel view synthesis for scenes undergoing object removal but also showcases notable efficiency gains in training and rendering speeds.

GScream: Learning 3D Geometry and Feature Consistent Gaussian Splatting for Object Removal

TL;DR

GScream addresses object removal in scenes represented by Gaussian Splatting, aiming to preserve multi-view geometry and texture coherence while enabling fast training and rendering. It couples monocular depth guidance for geometry refinement with a bidirectional cross-attention mechanism to transfer texture information between in-painted and visible regions, all within a lightweight Scaffold-GS framework. A depth-alignment strategy using scale and shift stabilizes monocular depth signals during optimization. On SPIn-NeRF and IBRNet, GScream delivers competitive or superior novel-view synthesis quality and significant efficiency gains over NeRF-based methods.

Abstract

This paper tackles the intricate challenge of object removal to update the radiance field using the 3D Gaussian Splatting. The main challenges of this task lie in the preservation of geometric consistency and the maintenance of texture coherence in the presence of the substantial discrete nature of Gaussian primitives. We introduce a robust framework specifically designed to overcome these obstacles. The key insight of our approach is the enhancement of information exchange among visible and invisible areas, facilitating content restoration in terms of both geometry and texture. Our methodology begins with optimizing the positioning of Gaussian primitives to improve geometric consistency across both removed and visible areas, guided by an online registration process informed by monocular depth estimation. Following this, we employ a novel feature propagation mechanism to bolster texture coherence, leveraging a cross-attention design that bridges sampling Gaussians from both uncertain and certain areas. This innovative approach significantly refines the texture coherence within the final radiance field. Extensive experiments validate that our method not only elevates the quality of novel view synthesis for scenes undergoing object removal but also showcases notable efficiency gains in training and rendering speeds.
Paper Structure (19 sections, 9 equations, 9 figures, 2 tables)

This paper contains 19 sections, 9 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Illustration of the Object Removal using 3D Gaussian Representations. Given a set of multi-view posed images and object masks, our goal is to learn a 3D consistent Gaussian representation modeling the scene with the object removed, which enables the consistent novel view synthesis without the specific object.
  • Figure 2: Illustration of our GScream framework. It consists of two novel components, which are monocular depth guided training and cross-attention feature regularization. Our 3D Gaussian splatting (3DGS) representation is initialized by the 3D SfM points and supervised by both images and multi-view monocular depth estimation. The additional depth losses help refine the geometry accuracy within the 3DGS framework. The following 3D feature regularization performs texture propagation to refine the appearance within the 3D in-painted region.
  • Figure 3: Illustration of the Cross-attention Feature Regularization. Our regularization module consists of 3D Gaussian Sampling and a Bidirectional Cross-Attention Module, propagating the 3D feature from surrounding blobs to the in-painted region. As a complement to the 2D prior, the cross-attention mechanism enables the transmission of information among 3D Gaussian blobs, further ensuring the similarity of appearance between the in-painted region and its surroundings.
  • Figure 4: Qualitative results compared with the most representative object-removal approaches. Illustration of the rendered qualitative images with object removed, compared with SPIn-NeRF mirzaei2023spin, OR-NeRF yin2023or, and View-Sub mirzaei2023reference. Our approach can synthesize high-quality images with natural removal effect.
  • Figure 5: Qualitative results of the effective depth-guided training. We visualize the scene in 3D Gaussian Splatting format and 2D rendered image by ablating the depth-guide training. The geometry guidance provides more information to fill the missing area with Gaussian blobs.
  • ...and 4 more figures