Table of Contents
Fetching ...

Sparse-view Pose Estimation and Reconstruction via Analysis by Generative Synthesis

Qitao Zhao, Shubham Tulsiani

TL;DR

SparseAGS tackles the problem of recovering 3D geometry and camera poses from a small set of unposed views. It introduces an analysis-by-generative-synthesis framework that integrates MV-DreamGaussian with 6-DoF diffusion priors and an outlier-aware optimization, enabling robust joint estimation of $\theta$ and $\Pi$ even when initial poses are imperfect. The approach includes a two-stage DreamGaussian-inspired initialization, a 6-DoF conditioning scheme for real-world novel-view synthesis, and a discrete-search plus continuous-refinement pipeline to identify and correct outliers. Empirical results on real and synthetic data show consistent improvements in pose accuracy and 3D reconstruction quality over state-of-the-art baselines, with competitive runtime. This work broadens the applicability of diffusion priors to joint 3D-pose estimation in sparse-view scenarios, offering practical benefits for unposed multi-view reconstruction in the wild.

Abstract

Inferring the 3D structure underlying a set of multi-view images typically requires solving two co-dependent tasks -- accurate 3D reconstruction requires precise camera poses, and predicting camera poses relies on (implicitly or explicitly) modeling the underlying 3D. The classical framework of analysis by synthesis casts this inference as a joint optimization seeking to explain the observed pixels, and recent instantiations learn expressive 3D representations (e.g., Neural Fields) with gradient-descent-based pose refinement of initial pose estimates. However, given a sparse set of observed views, the observations may not provide sufficient direct evidence to obtain complete and accurate 3D. Moreover, large errors in pose estimation may not be easily corrected and can further degrade the inferred 3D. To allow robust 3D reconstruction and pose estimation in this challenging setup, we propose SparseAGS, a method that adapts this analysis-by-synthesis approach by: a) including novel-view-synthesis-based generative priors in conjunction with photometric objectives to improve the quality of the inferred 3D, and b) explicitly reasoning about outliers and using a discrete search with a continuous optimization-based strategy to correct them. We validate our framework across real-world and synthetic datasets in combination with several off-the-shelf pose estimation systems as initialization. We find that it significantly improves the base systems' pose accuracy while yielding high-quality 3D reconstructions that outperform the results from current multi-view reconstruction baselines.

Sparse-view Pose Estimation and Reconstruction via Analysis by Generative Synthesis

TL;DR

SparseAGS tackles the problem of recovering 3D geometry and camera poses from a small set of unposed views. It introduces an analysis-by-generative-synthesis framework that integrates MV-DreamGaussian with 6-DoF diffusion priors and an outlier-aware optimization, enabling robust joint estimation of and even when initial poses are imperfect. The approach includes a two-stage DreamGaussian-inspired initialization, a 6-DoF conditioning scheme for real-world novel-view synthesis, and a discrete-search plus continuous-refinement pipeline to identify and correct outliers. Empirical results on real and synthetic data show consistent improvements in pose accuracy and 3D reconstruction quality over state-of-the-art baselines, with competitive runtime. This work broadens the applicability of diffusion priors to joint 3D-pose estimation in sparse-view scenarios, offering practical benefits for unposed multi-view reconstruction in the wild.

Abstract

Inferring the 3D structure underlying a set of multi-view images typically requires solving two co-dependent tasks -- accurate 3D reconstruction requires precise camera poses, and predicting camera poses relies on (implicitly or explicitly) modeling the underlying 3D. The classical framework of analysis by synthesis casts this inference as a joint optimization seeking to explain the observed pixels, and recent instantiations learn expressive 3D representations (e.g., Neural Fields) with gradient-descent-based pose refinement of initial pose estimates. However, given a sparse set of observed views, the observations may not provide sufficient direct evidence to obtain complete and accurate 3D. Moreover, large errors in pose estimation may not be easily corrected and can further degrade the inferred 3D. To allow robust 3D reconstruction and pose estimation in this challenging setup, we propose SparseAGS, a method that adapts this analysis-by-synthesis approach by: a) including novel-view-synthesis-based generative priors in conjunction with photometric objectives to improve the quality of the inferred 3D, and b) explicitly reasoning about outliers and using a discrete search with a continuous optimization-based strategy to correct them. We validate our framework across real-world and synthetic datasets in combination with several off-the-shelf pose estimation systems as initialization. We find that it significantly improves the base systems' pose accuracy while yielding high-quality 3D reconstructions that outperform the results from current multi-view reconstruction baselines.

Paper Structure

This paper contains 19 sections, 8 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Given a set of unposed input images, SparseAGS jointly infers the corresponding camera poses and underlying 3D, allowing high-fidelity 3D inference in the wild.
  • Figure 2: (a) Overview of SparseAGS: Given estimated camera poses from off-the-shelf models, our method iteratively reconstructs 3D and optimizes poses leveraging diffusion priors. (b) Detailed View of Each Component: We use rendering loss and multi-view SDS loss for 3D reconstruction while the rendering loss is propagated back to refine camera poses. At the end of each reconstruction iteration, we identify outliers by checking if their involvement in 3D inference yields larger errors in other views, implying the inconsistency of their poses with others.
  • Figure 3: Qualitative Comparison on Camera Pose Accuracy. Given initial poses from off-the-shelf methods (top to bottom: DUSt3R wang2023DUSt3R, Ray Diff. zhang2024cameras and RelPose++ lin2023relpose++), the refined poses from SPARFtruong2023sparf are compared with the output of SparseAGS. The estimated cameras are aligned with ground truth (in black) with an optimal similarity transform. More results are available in Fig. \ref{['fig:vis_compare_sparf_supp']}.
  • Figure 4: Qualitative Comparison with LEAP jiang2024leap on Novel View Synthesis. We use two pose estimation baselines (Ray Diffusion zhang2024cameras and DUSt3R wang2023DUSt3R). SparseAGS better preserves details from the input images and shows enhanced performance with more accurate initial camera poses. More results are available in Fig. \ref{['fig:vis_compare_leap_supp']} of the appendix.
  • Figure 5: Qualitative Comparison with UpFusion kani2023upfusion on Novel View Synthesis. We use two pose estimation baselines (Ray Diffusion zhang2024cameras and DUSt3R wang2023DUSt3R) as in Fig. \ref{['fig:vis_compare_leap']}. Note that the left eye and symbol ② of the Chicken Racer is missing in UpFusion's output, probably because of the "first-image bias", while SparseAGS preserves these details.
  • ...and 4 more figures