Table of Contents
Fetching ...

SPARS3R: Semantic Prior Alignment and Regularization for Sparse 3D Reconstruction

Yutao Tang, Yuxiang Guo, Deming Li, Cheng Peng

TL;DR

SPARS3R is presented, which combines the advantages of accurate pose estimation from Structure-from-Motion and dense point cloud from depth estimation and significantly outperforms existing approaches in photorealistic rendering with sparse images.

Abstract

Recent efforts in Gaussian-Splat-based Novel View Synthesis can achieve photorealistic rendering; however, such capability is limited in sparse-view scenarios due to sparse initialization and over-fitting floaters. Recent progress in depth estimation and alignment can provide dense point cloud with few views; however, the resulting pose accuracy is suboptimal. In this work, we present SPARS3R, which combines the advantages of accurate pose estimation from Structure-from-Motion and dense point cloud from depth estimation. To this end, SPARS3R first performs a Global Fusion Alignment process that maps a prior dense point cloud to a sparse point cloud from Structure-from-Motion based on triangulated correspondences. RANSAC is applied during this process to distinguish inliers and outliers. SPARS3R then performs a second, Semantic Outlier Alignment step, which extracts semantically coherent regions around the outliers and performs local alignment in these regions. Along with several improvements in the evaluation process, we demonstrate that SPARS3R can achieve photorealistic rendering with sparse images and significantly outperforms existing approaches.

SPARS3R: Semantic Prior Alignment and Regularization for Sparse 3D Reconstruction

TL;DR

SPARS3R is presented, which combines the advantages of accurate pose estimation from Structure-from-Motion and dense point cloud from depth estimation and significantly outperforms existing approaches in photorealistic rendering with sparse images.

Abstract

Recent efforts in Gaussian-Splat-based Novel View Synthesis can achieve photorealistic rendering; however, such capability is limited in sparse-view scenarios due to sparse initialization and over-fitting floaters. Recent progress in depth estimation and alignment can provide dense point cloud with few views; however, the resulting pose accuracy is suboptimal. In this work, we present SPARS3R, which combines the advantages of accurate pose estimation from Structure-from-Motion and dense point cloud from depth estimation. To this end, SPARS3R first performs a Global Fusion Alignment process that maps a prior dense point cloud to a sparse point cloud from Structure-from-Motion based on triangulated correspondences. RANSAC is applied during this process to distinguish inliers and outliers. SPARS3R then performs a second, Semantic Outlier Alignment step, which extracts semantically coherent regions around the outliers and performs local alignment in these regions. Along with several improvements in the evaluation process, we demonstrate that SPARS3R can achieve photorealistic rendering with sparse images and significantly outperforms existing approaches.

Paper Structure

This paper contains 17 sections, 11 equations, 6 figures, 4 tables, 1 algorithm.

Figures (6)

  • Figure 1: A visualization of SPAS3R in comparison to current SoTA. Without additional prior, sparse NVS leads to incorrect geometry by Instant-NGP muller2022instant. FSGS zhu2025fsgs can be blurry due to sparse initialization and insufficient densification. InstantSplat fan2024instantsplat relies on DUSt3R wang2024dust3r initialization with suboptimal poses. Our method, SPARS3R, can reliably render details in the foreground and background with accurate poses.
  • Figure 2: SPARS3R combines a prior dense point cloud $\chi$ and a sparse SfM point cloud $\widebar{\mathbf{X}}$. The prior $\chi$ often has inferior depth accuracy compared to $\widebar{\mathbf{X}}$. SPARS3R first globally aligns all points in $\chi$ onto $\widebar{\mathbf{X}}$, based on shared correspondences. Inliers and outliers are identified through alignment error. SPARS3R then extracts the semantically relevant 2D regions around the outliers to move local regions of $\chi$ in groups, producing a dense point cloud $\chi^*$ that is depth-wise and pose-wise accurate. By providing $\chi^*$ for Gaussian optimization, SPARS3R achieves photorealistic rendering under sparse-view condition.
  • Figure 3: Evaluation of different metrics to camera pose shift. We extract a sequence of images with a small pose change at each step and set the first frame as the reference. PSNR, SSIM, LPIPS, and DSIM are computed. DSIM shows robustness to small pose shifts by the flattest line.
  • Figure 4: The trajectory of the camera poses estimated from different approaches for Bonsai and Stump in MipNeRF360 barron2022mip360.
  • Figure 5: Visual comparison of SPARS3R with and without SOA. While the dense bonsai in the foreground is aligned with the sparse point cloud, depth differences are obvious. SOA successfully fixes such gap.
  • ...and 1 more figures