Table of Contents
Fetching ...

CoherentGS: Sparse Novel View Synthesis with Coherent 3D Gaussians

Avinash Paliwal, Wei Ye, Jinhui Xiong, Dmytro Kotovenko, Rakesh Ranjan, Vikas Chandra, Nima Khademi Kalantari

TL;DR

A structured Gaussian representation that can be controlled in 2D image space is introduced and an approach to initialize the Gaussians using monocular depth estimates at each input view to support regularized optimization and depth-based initialization is proposed.

Abstract

The field of 3D reconstruction from images has rapidly evolved in the past few years, first with the introduction of Neural Radiance Field (NeRF) and more recently with 3D Gaussian Splatting (3DGS). The latter provides a significant edge over NeRF in terms of the training and inference speed, as well as the reconstruction quality. Although 3DGS works well for dense input images, the unstructured point-cloud like representation quickly overfits to the more challenging setup of extremely sparse input images (e.g., 3 images), creating a representation that appears as a jumble of needles from novel views. To address this issue, we propose regularized optimization and depth-based initialization. Our key idea is to introduce a structured Gaussian representation that can be controlled in 2D image space. We then constraint the Gaussians, in particular their position, and prevent them from moving independently during optimization. Specifically, we introduce single and multiview constraints through an implicit convolutional decoder and a total variation loss, respectively. With the coherency introduced to the Gaussians, we further constrain the optimization through a flow-based loss function. To support our regularized optimization, we propose an approach to initialize the Gaussians using monocular depth estimates at each input view. We demonstrate significant improvements compared to the state-of-the-art sparse-view NeRF-based approaches on a variety of scenes.

CoherentGS: Sparse Novel View Synthesis with Coherent 3D Gaussians

TL;DR

A structured Gaussian representation that can be controlled in 2D image space is introduced and an approach to initialize the Gaussians using monocular depth estimates at each input view to support regularized optimization and depth-based initialization is proposed.

Abstract

The field of 3D reconstruction from images has rapidly evolved in the past few years, first with the introduction of Neural Radiance Field (NeRF) and more recently with 3D Gaussian Splatting (3DGS). The latter provides a significant edge over NeRF in terms of the training and inference speed, as well as the reconstruction quality. Although 3DGS works well for dense input images, the unstructured point-cloud like representation quickly overfits to the more challenging setup of extremely sparse input images (e.g., 3 images), creating a representation that appears as a jumble of needles from novel views. To address this issue, we propose regularized optimization and depth-based initialization. Our key idea is to introduce a structured Gaussian representation that can be controlled in 2D image space. We then constraint the Gaussians, in particular their position, and prevent them from moving independently during optimization. Specifically, we introduce single and multiview constraints through an implicit convolutional decoder and a total variation loss, respectively. With the coherency introduced to the Gaussians, we further constrain the optimization through a flow-based loss function. To support our regularized optimization, we propose an approach to initialize the Gaussians using monocular depth estimates at each input view. We demonstrate significant improvements compared to the state-of-the-art sparse-view NeRF-based approaches on a variety of scenes.
Paper Structure (18 sections, 9 equations, 15 figures, 5 tables)

This paper contains 18 sections, 9 equations, 15 figures, 5 tables.

Figures (15)

  • Figure 1: For sparse input views, the quality of 3DGS deteriorates. Notable artifacts are observed in the results of the NeRF-based methods by Yang et al. yang2023freenerf (FreeNeRF) and Wang et al. wang2023sparsenerf (SparseNeRF). Our approach ("Ours w/o inpainting") yields high-quality synthesized views. Note that our constraints do not allow the Gaussians to move freely in the 3D space. As a result, our approach does not reconstruct the areas that are occluded in all the input images. This is an advantage of our technique over other methods that fill in these areas with blurry and repetitive structure, as we can identify and inpaint these regions and produce realistic hallucinated details. As a proof of concept, we inpaint these regions using a diffusion model and project them to 3D using monocular depth. As shown on the right, the hallucinated details and their corresponding depth are reasonable.
  • Figure 2: Overview of the optimization pipeline. For every input image, we obtain monocular depth (Depth Anything yang2024depth) and dense flow correspondences between all image pairs (FlowFormer++ shi2023flowformer++). These inputs are utilized to initialize a good set of 3D Gaussians for the subsequent optimization stage. The initialized 3D Gaussians, along with depth-based segmentation masks, are then used to perform a regularized 3D Gaussian optimization to obtain high-quality reconstruction.
  • Figure 3: During regularized optimization, the implicit decoder predicts the residual depth $\Delta D$ that moves the Gaussians from their initial position towards the true scene depth $D$. The input coordinate $n$ to the decoder corresponds to the input view with camera $cam_n$. To preserve sharp discontinuities, we apply binary segmentation masks to the decoder output obtained by thresholding the monocular depth.
  • Figure 4: Our flow based regularization forces the Gaussians of corresponding pixels in a pair of images (e.g., yellow and cyan squares) to have similar positions, by minimizing their distance. The binary mask is utilized to mask out unreliable correspondences.
  • Figure 5: We initialize a set of 3D Gaussians from each view using monocular depth to support our regularized optimization. However, since the monocular depth is relative, the initialized representation is not multi-view consistent (left). Therefore, before Gaussian initialization, we coarsely align the representations from different images using flow correspondences (right). This ensures that the optimization begins from a sensible starting point, which proves to be essential for the training of 3D Gaussian representation under the challenging ill-conditioned setting.
  • ...and 10 more figures