Table of Contents
Fetching ...

AuraFusion360: Augmented Unseen Region Alignment for Reference-based 360° Unbounded Scene Inpainting

Chung-Ho Wu, Yang-Jung Chen, Ying-Huan Chen, Jie-Ying Lee, Bo-Hsu Ke, Chun-Wei Tuan Mu, Yi-Chuan Huang, Chin-Yang Lin, Min-Hung Chen, Yen-Yu Lin, Yu-Lun Liu

TL;DR

AuraFusion360 tackles 360° unbounded scene inpainting by marrying explicit 3D Gaussian Splatting with diffusion-based 2D inpainting guided by a reference view. The approach introduces depth-aware unseen mask generation, Adaptive Guided Depth Diffusion (AGDD), and SDEdit-based RGB guidance to ensure multi-view consistency and geometric fidelity across large viewpoint changes. A new 360-USID dataset with ground-truth novel views enables rigorous evaluation, and experiments show superior perceptual quality (lower LPIPS) and higher PSNR compared with state-of-the-art methods. This framework enables robust, reference-guided 3D inpainting for VR/AR and architectural visualization, with potential extensions to efficiency, dynamic scenes, and language-guided editing.

Abstract

Three-dimensional scene inpainting is crucial for applications from virtual reality to architectural visualization, yet existing methods struggle with view consistency and geometric accuracy in 360° unbounded scenes. We present AuraFusion360, a novel reference-based method that enables high-quality object removal and hole filling in 3D scenes represented by Gaussian Splatting. Our approach introduces (1) depth-aware unseen mask generation for accurate occlusion identification, (2) Adaptive Guided Depth Diffusion, a zero-shot method for accurate initial point placement without requiring additional training, and (3) SDEdit-based detail enhancement for multi-view coherence. We also introduce 360-USID, the first comprehensive dataset for 360° unbounded scene inpainting with ground truth. Extensive experiments demonstrate that AuraFusion360 significantly outperforms existing methods, achieving superior perceptual quality while maintaining geometric accuracy across dramatic viewpoint changes.

AuraFusion360: Augmented Unseen Region Alignment for Reference-based 360° Unbounded Scene Inpainting

TL;DR

AuraFusion360 tackles 360° unbounded scene inpainting by marrying explicit 3D Gaussian Splatting with diffusion-based 2D inpainting guided by a reference view. The approach introduces depth-aware unseen mask generation, Adaptive Guided Depth Diffusion (AGDD), and SDEdit-based RGB guidance to ensure multi-view consistency and geometric fidelity across large viewpoint changes. A new 360-USID dataset with ground-truth novel views enables rigorous evaluation, and experiments show superior perceptual quality (lower LPIPS) and higher PSNR compared with state-of-the-art methods. This framework enables robust, reference-guided 3D inpainting for VR/AR and architectural visualization, with potential extensions to efficiency, dynamic scenes, and language-guided editing.

Abstract

Three-dimensional scene inpainting is crucial for applications from virtual reality to architectural visualization, yet existing methods struggle with view consistency and geometric accuracy in 360° unbounded scenes. We present AuraFusion360, a novel reference-based method that enables high-quality object removal and hole filling in 3D scenes represented by Gaussian Splatting. Our approach introduces (1) depth-aware unseen mask generation for accurate occlusion identification, (2) Adaptive Guided Depth Diffusion, a zero-shot method for accurate initial point placement without requiring additional training, and (3) SDEdit-based detail enhancement for multi-view coherence. We also introduce 360-USID, the first comprehensive dataset for 360° unbounded scene inpainting with ground truth. Extensive experiments demonstrate that AuraFusion360 significantly outperforms existing methods, achieving superior perceptual quality while maintaining geometric accuracy across dramatic viewpoint changes.

Paper Structure

This paper contains 28 sections, 12 equations, 15 figures, 4 tables.

Figures (15)

  • Figure 1: Comparison with different 3D inpainting approaches. Existing methods such as SPin-NeRF spinnerf and GScream wang2024gscream, designed for forward-facing scenes, perform poorly in 360° scenarios. Reference-based methods like Infusion liu2024infusion struggle with accurate depth projection, causing fine-tuning artifacts. Gaussian Grouping ye2023gaussian frequently misidentifies unseen regions, reducing inpainting quality. Our AuraFusion360 achieves precise unseen masks and improved depth alignment via Adaptive Guided Depth Diffusion, employing SDEdit meng2022sdedit for diffusion-guided, multi-view consistent RGB generation.
  • Figure 2: Overview of our method. Our approach takes multi-view RGB images and corresponding object masks as input and outputs a Gaussian representation with the masked objects removed. The pipeline consists of three main stages: (a) Depth-Aware Unseen Masks Generation to identify truly occluded areas, referred to as the "unseen region", (b) Depth-Aligned Gaussian Initialization on Reference View to fill unseen regions with initialized Gaussian containing reference RGB information after object removal, and (c) SDEdit-Based RGB Guidance for Detail Enhancement, which enhances fine details using an inpainting model while preserving reference view information. Instead of applying SDEdit with random noise, we use DDIM Inversion on the rendered initial Gaussians to generate noise that retains the structure of the reference view, ensuring multi-view consistency across all RGB Guidance.
  • Figure 3: Overview of the Unseen Mask Generation Process using Depth Warping. To obtain the unseen mask for view $n$, we calculate the pixel correspondences between the view $n$ and all other views $i$ by using the rendered incomplete depth $D_{n}^{\text{incomplete}}$. For each view $i$, the removal region $R_i$ is backward traversal to view $n$ to align occlusions. We then aggregate the results from multiple views, averaging and applying a threshold to produce the initial contour of the unseen mask. This contour is subsequently converted into a bounding box prompt for SAM2 ravi2024sam2, which refines the unseen mask to its final version for view $n$.
  • Figure 4: Overview of Adaptive Guided Depth Diffusion (AGDD). The framework takes image latent, incomplete depth, and unseen mask as inputs to generate aligned depth estimates. (a) The guided region is identified by dilating the unseen mask and subtracting the original mask. (b) At each timestep $t$, adaptive loss $\mathcal{L}_\text{adaptive}$ is computed between the pre-decoded and incomplete depth to update the noise input $\hat{\epsilon}_t$. This repeats $N$ times before advancing to the next denoising step, ensuring the estimated depth aligns with the incomplete depth distribution in the guided region.
  • Figure 5: Overview of the 360-USID dataset. Sample images from each scene, including five outdoor scenes (Carton, Cone, Newcone, Skateboard, Plant) and two indoor scenes (Cookie, Sunflower). (Bottom right) The table shows statistics for each scene, including the number of training views and ground truth (GT) novel views. The dataset provides a diverse range of environments for evaluating 3D inpainting methods in both indoor and outdoor settings.
  • ...and 10 more figures