Table of Contents
Fetching ...

Sp2360: Sparse-view 360 Scene Reconstruction using Cascaded 2D Diffusion Priors

Soumava Paul, Christopher Wewer, Bernt Schiele, Jan Eric Lenssen

TL;DR

Sp$^2$360 addresses sparse-view 360° scene reconstruction by distilling strong 2D diffusion priors into an explicit 3D Gaussian representation (3DGS). It employs a cascaded diffusion pipeline—in-painting to fill unobserved regions and artifact removal to clean generated views—to iteratively synthesize and fuse novel views into a coherent 3D model. The method starts from a sparse 3DGS built from $M$ views and autoregressively adds pseudo views, achieving multi-view consistency with modest data and compute. On the challenging MipNeRF360 dataset, Sp$^2$360 outperforms prior sparse-view methods, producing rich foreground and background detail from as few as 9 input views.

Abstract

We aim to tackle sparse-view reconstruction of a 360 3D scene using priors from latent diffusion models (LDM). The sparse-view setting is ill-posed and underconstrained, especially for scenes where the camera rotates 360 degrees around a point, as no visual information is available beyond some frontal views focused on the central object(s) of interest. In this work, we show that pretrained 2D diffusion models can strongly improve the reconstruction of a scene with low-cost fine-tuning. Specifically, we present SparseSplat360 (Sp2360), a method that employs a cascade of in-painting and artifact removal models to fill in missing details and clean novel views. Due to superior training and rendering speeds, we use an explicit scene representation in the form of 3D Gaussians over NeRF-based implicit representations. We propose an iterative update strategy to fuse generated pseudo novel views with existing 3D Gaussians fitted to the initial sparse inputs. As a result, we obtain a multi-view consistent scene representation with details coherent with the observed inputs. Our evaluation on the challenging Mip-NeRF360 dataset shows that our proposed 2D to 3D distillation algorithm considerably improves the performance of a regularized version of 3DGS adapted to a sparse-view setting and outperforms existing sparse-view reconstruction methods in 360 scene reconstruction. Qualitatively, our method generates entire 360 scenes from as few as 9 input views, with a high degree of foreground and background detail.

Sp2360: Sparse-view 360 Scene Reconstruction using Cascaded 2D Diffusion Priors

TL;DR

Sp360 addresses sparse-view 360° scene reconstruction by distilling strong 2D diffusion priors into an explicit 3D Gaussian representation (3DGS). It employs a cascaded diffusion pipeline—in-painting to fill unobserved regions and artifact removal to clean generated views—to iteratively synthesize and fuse novel views into a coherent 3D model. The method starts from a sparse 3DGS built from views and autoregressively adds pseudo views, achieving multi-view consistency with modest data and compute. On the challenging MipNeRF360 dataset, Sp360 outperforms prior sparse-view methods, producing rich foreground and background detail from as few as 9 input views.

Abstract

We aim to tackle sparse-view reconstruction of a 360 3D scene using priors from latent diffusion models (LDM). The sparse-view setting is ill-posed and underconstrained, especially for scenes where the camera rotates 360 degrees around a point, as no visual information is available beyond some frontal views focused on the central object(s) of interest. In this work, we show that pretrained 2D diffusion models can strongly improve the reconstruction of a scene with low-cost fine-tuning. Specifically, we present SparseSplat360 (Sp2360), a method that employs a cascade of in-painting and artifact removal models to fill in missing details and clean novel views. Due to superior training and rendering speeds, we use an explicit scene representation in the form of 3D Gaussians over NeRF-based implicit representations. We propose an iterative update strategy to fuse generated pseudo novel views with existing 3D Gaussians fitted to the initial sparse inputs. As a result, we obtain a multi-view consistent scene representation with details coherent with the observed inputs. Our evaluation on the challenging Mip-NeRF360 dataset shows that our proposed 2D to 3D distillation algorithm considerably improves the performance of a regularized version of 3DGS adapted to a sparse-view setting and outperforms existing sparse-view reconstruction methods in 360 scene reconstruction. Qualitatively, our method generates entire 360 scenes from as few as 9 input views, with a high degree of foreground and background detail.
Paper Structure (42 sections, 12 equations, 12 figures, 4 tables, 1 algorithm)

This paper contains 42 sections, 12 equations, 12 figures, 4 tables, 1 algorithm.

Figures (12)

  • Figure 1: Overview of Sp$^2$360. We render 3D Gaussians fitted to our sparse set of $M$ views from a novel viewpoint. The image has missing regions and Gaussian artifacts, which are fixed by a combination of in-painting and denoising diffusion models. This then acts as pseudo ground truth to spawn and update 3D Gaussians and satisfy the new view constraints. This process is repeated for several novel views spanning the $360^{\circ}$ scene until the representation becomes multi-view consistent.
  • Figure 2: Artifact removal fine-tuning. Pairs of clean images and images with artifacts are obtained from 3DGS fitted to sparse and dense observations, respectively, across $36$ scenes. These are combined with one of 51 synonymous prompts generated by GPT-4 achiam2023gpt from a base instruction. SD v1.5 ldm is then fine-tuned with a dataset of $10.5K$ samples for the Gaussian artifact removal task.
  • Figure 3: Qualitative comparison of Sp$^2$360 with few-view methods. Our approach consistently fairs better in recovering image structure from foggy geometry, where baselines typically struggle with "floaters" and color artifacts. We encourage the reader to refer to our supplemental 360° video, where the benefits of our method can be observed along a smooth trajectory.
  • Figure 4: Ablation Study on $9$-view reconstruction of garden scene. Our fine-tuned artifact removal module and iterative schedule contribute the most toward quality of the final reconstruction. 3D Gaussians from Sparse 3DGS act as suitable geometric prior in the absence of explicit view conditioning.
  • Figure 5: Scalability of Sp$^2$360 with input views. Our combination of fine-tuned diffusion priors improves performance of 3DGS up to $27$ input views of the bicycle scene, alleviating the need for dense captures.
  • ...and 7 more figures