Table of Contents
Fetching ...

GloSplat: Joint Pose-Appearance Optimization for Faster and More Accurate 3D Reconstruction

Tianyu Xiong, Rui Li, Linjie Li, Jiaqi Yang

TL;DR

GloSplat is presented, a framework that performs joint pose-appearance optimization during 3D Gaussian Splatting training and achieves state-of-the-art among COLMAP-free methods while GloSplat-A surpasses all COLMAP-based baselines.

Abstract

Feature extraction, matching, structure from motion (SfM), and novel view synthesis (NVS) have traditionally been treated as separate problems with independent optimization objectives. We present GloSplat, a framework that performs \emph{joint pose-appearance optimization} during 3D Gaussian Splatting training. Unlike prior joint optimization methods (BARF, NeRF--, 3RGS) that rely purely on photometric gradients for pose refinement, GloSplat preserves \emph{explicit SfM feature tracks} as first-class entities throughout training: track 3D points are maintained as separate optimizable parameters from Gaussian primitives, providing persistent geometric anchors via a reprojection loss that operates alongside photometric supervision. This architectural choice prevents early-stage pose drift while enabling fine-grained refinement -- a capability absent in photometric-only approaches. We introduce two pipeline variants: (1) \textbf{GloSplat-F}, a COLMAP-free variant using retrieval-based pair selection for efficient reconstruction, and (2) \textbf{GloSplat-A}, an exhaustive matching variant for maximum quality. Both employ global SfM initialization followed by joint photometric-geometric optimization during 3DGS training. Experiments demonstrate that GloSplat-F achieves state-of-the-art among COLMAP-free methods while GloSplat-A surpasses all COLMAP-based baselines.

GloSplat: Joint Pose-Appearance Optimization for Faster and More Accurate 3D Reconstruction

TL;DR

GloSplat is presented, a framework that performs joint pose-appearance optimization during 3D Gaussian Splatting training and achieves state-of-the-art among COLMAP-free methods while GloSplat-A surpasses all COLMAP-based baselines.

Abstract

Feature extraction, matching, structure from motion (SfM), and novel view synthesis (NVS) have traditionally been treated as separate problems with independent optimization objectives. We present GloSplat, a framework that performs \emph{joint pose-appearance optimization} during 3D Gaussian Splatting training. Unlike prior joint optimization methods (BARF, NeRF--, 3RGS) that rely purely on photometric gradients for pose refinement, GloSplat preserves \emph{explicit SfM feature tracks} as first-class entities throughout training: track 3D points are maintained as separate optimizable parameters from Gaussian primitives, providing persistent geometric anchors via a reprojection loss that operates alongside photometric supervision. This architectural choice prevents early-stage pose drift while enabling fine-grained refinement -- a capability absent in photometric-only approaches. We introduce two pipeline variants: (1) \textbf{GloSplat-F}, a COLMAP-free variant using retrieval-based pair selection for efficient reconstruction, and (2) \textbf{GloSplat-A}, an exhaustive matching variant for maximum quality. Both employ global SfM initialization followed by joint photometric-geometric optimization during 3DGS training. Experiments demonstrate that GloSplat-F achieves state-of-the-art among COLMAP-free methods while GloSplat-A surpasses all COLMAP-based baselines.
Paper Structure (54 sections, 6 equations, 4 figures, 10 tables, 1 algorithm)

This paper contains 54 sections, 6 equations, 4 figures, 10 tables, 1 algorithm.

Figures (4)

  • Figure 1: Accuracy vs. Speed. Average PSNR on MipNeRF360 vs. runtime for 1000 images (Courthouse scene). GloSplat-F achieves 13.3$\times$ speedup over GPU-accelerated COLMAP+3DGS while improving PSNR by +0.38 dB. GloSplat-A achieves the highest accuracy (28.86 dB), surpassing all baselines. Our joint pose-appearance optimization enables both variants to occupy the Pareto frontier. (All methods benchmarked on the same GPU; see \ref{['sec:runtime_appendix']} for details.)
  • Figure 2: GloSplat Pipeline. Given unposed input images, local correspondences are extracted (frozen preprocessing): XFeat+LightGlue with retrieval-based pairs (GloSplat-F) or SIFT with exhaustive matching (GloSplat-A). Global SfM simultaneously estimates all camera poses through rotation averaging, positioning, and bundle adjustment, providing robust initialization. Joint 3DGS training (our core contribution) then continuously refines poses through a reprojection-based BA loss while optimizing Gaussian primitives, enabling combined photometric-geometric supervision that prevents drift and improves reconstruction quality.
  • Figure 3: Runtime Comparison on Courthouse Scene. End-to-end reconstruction time (in seconds) as a function of the number of input images. All methods use the same GPU (RTX PRO 6000); COLMAP is compiled with CUDA and uses GPU acceleration for feature extraction/matching. GloSplat-F achieves 13.3$\times$ speedup over COLMAP at 1000 images due to retrieval-based pair selection and parallel global SfM. VGGT-X is faster at smaller scales but GloSplat-F surpasses it at 750+ images due to better asymptotic scaling.
  • Figure 4: Qualitative Comparison on MipNeRF360. We compare novel view synthesis results from GloSplat-A, GloSplat-F, VGGT-X, and Improved-GS against ground truth. GloSplat-A achieves significantly higher PSNR and lower LPIPS across all scenes. On Bonsai, GloSplat-A outperforms VGGT-X by +8.57 dB PSNR. On Flowers, our method shows +6.76 dB PSNR and 46% lower LPIPS (0.147 vs 0.273) over VGGT-X. On Garden, GloSplat-A achieves +5.69 dB over VGGT-X. Best viewed zoomed in.