GloSplat: Joint Pose-Appearance Optimization for Faster and More Accurate 3D Reconstruction

Tianyu Xiong; Rui Li; Linjie Li; Jiaqi Yang

GloSplat: Joint Pose-Appearance Optimization for Faster and More Accurate 3D Reconstruction

Tianyu Xiong, Rui Li, Linjie Li, Jiaqi Yang

TL;DR

GloSplat is presented, a framework that performs joint pose-appearance optimization during 3D Gaussian Splatting training and achieves state-of-the-art among COLMAP-free methods while GloSplat-A surpasses all COLMAP-based baselines.

Abstract

Feature extraction, matching, structure from motion (SfM), and novel view synthesis (NVS) have traditionally been treated as separate problems with independent optimization objectives. We present GloSplat, a framework that performs \emph{joint pose-appearance optimization} during 3D Gaussian Splatting training. Unlike prior joint optimization methods (BARF, NeRF--, 3RGS) that rely purely on photometric gradients for pose refinement, GloSplat preserves \emph{explicit SfM feature tracks} as first-class entities throughout training: track 3D points are maintained as separate optimizable parameters from Gaussian primitives, providing persistent geometric anchors via a reprojection loss that operates alongside photometric supervision. This architectural choice prevents early-stage pose drift while enabling fine-grained refinement -- a capability absent in photometric-only approaches. We introduce two pipeline variants: (1) \textbf{GloSplat-F}, a COLMAP-free variant using retrieval-based pair selection for efficient reconstruction, and (2) \textbf{GloSplat-A}, an exhaustive matching variant for maximum quality. Both employ global SfM initialization followed by joint photometric-geometric optimization during 3DGS training. Experiments demonstrate that GloSplat-F achieves state-of-the-art among COLMAP-free methods while GloSplat-A surpasses all COLMAP-based baselines.

GloSplat: Joint Pose-Appearance Optimization for Faster and More Accurate 3D Reconstruction

TL;DR

Abstract

Paper Structure (54 sections, 6 equations, 4 figures, 10 tables, 1 algorithm)

This paper contains 54 sections, 6 equations, 4 figures, 10 tables, 1 algorithm.

Introduction
Related Work
Novel View Synthesis and 3D Gaussian Splatting.
Structure from Motion.
Learned Features and Matching.
COLMAP-Free Methods.
Joint Pose and Radiance Optimization.
Method
Learned Feature Extraction and Matching
Image Pair Selection.
Feature Extraction and Matching.
Global Structure from Motion
View Graph and Calibration.
Rotation Averaging.
Track Establishment and Positioning.
...and 39 more sections

Figures (4)

Figure 1: Accuracy vs. Speed. Average PSNR on MipNeRF360 vs. runtime for 1000 images (Courthouse scene). GloSplat-F achieves 13.3$\times$ speedup over GPU-accelerated COLMAP+3DGS while improving PSNR by +0.38 dB. GloSplat-A achieves the highest accuracy (28.86 dB), surpassing all baselines. Our joint pose-appearance optimization enables both variants to occupy the Pareto frontier. (All methods benchmarked on the same GPU; see \ref{['sec:runtime_appendix']} for details.)
Figure 2: GloSplat Pipeline. Given unposed input images, local correspondences are extracted (frozen preprocessing): XFeat+LightGlue with retrieval-based pairs (GloSplat-F) or SIFT with exhaustive matching (GloSplat-A). Global SfM simultaneously estimates all camera poses through rotation averaging, positioning, and bundle adjustment, providing robust initialization. Joint 3DGS training (our core contribution) then continuously refines poses through a reprojection-based BA loss while optimizing Gaussian primitives, enabling combined photometric-geometric supervision that prevents drift and improves reconstruction quality.
Figure 3: Runtime Comparison on Courthouse Scene. End-to-end reconstruction time (in seconds) as a function of the number of input images. All methods use the same GPU (RTX PRO 6000); COLMAP is compiled with CUDA and uses GPU acceleration for feature extraction/matching. GloSplat-F achieves 13.3$\times$ speedup over COLMAP at 1000 images due to retrieval-based pair selection and parallel global SfM. VGGT-X is faster at smaller scales but GloSplat-F surpasses it at 750+ images due to better asymptotic scaling.
Figure 4: Qualitative Comparison on MipNeRF360. We compare novel view synthesis results from GloSplat-A, GloSplat-F, VGGT-X, and Improved-GS against ground truth. GloSplat-A achieves significantly higher PSNR and lower LPIPS across all scenes. On Bonsai, GloSplat-A outperforms VGGT-X by +8.57 dB PSNR. On Flowers, our method shows +6.76 dB PSNR and 46% lower LPIPS (0.147 vs 0.273) over VGGT-X. On Garden, GloSplat-A achieves +5.69 dB over VGGT-X. Best viewed zoomed in.

GloSplat: Joint Pose-Appearance Optimization for Faster and More Accurate 3D Reconstruction

TL;DR

Abstract

GloSplat: Joint Pose-Appearance Optimization for Faster and More Accurate 3D Reconstruction

Authors

TL;DR

Abstract

Table of Contents

Figures (4)