Table of Contents
Fetching ...

KeyGS: A Keyframe-Centric Gaussian Splatting Method for Monocular Image Sequences

Keng-Wei Chang, Zi-Ming Wang, Shang-Hong Lai

TL;DR

This work addresses the bottleneck of pose-dependent training in 3D Gaussian Splatting for monocular sequences. It introduces KeyGS, which quickly estimates rough camera poses using sequential COLMAP SfM on keyframes and then jointly refines poses and the 3DGS representation, aided by a coarse-to-fine frequency-aware densification strategy. The approach reduces training time from hours to minutes and yields more accurate novel view synthesis and camera pose estimates than previous methods, while mitigating pose drift. Overall, KeyGS offers a practical, scalable solution for real-time or near-real-time 3D reconstruction from monocular video with robust pose handling.

Abstract

Reconstructing high-quality 3D models from sparse 2D images has garnered significant attention in computer vision. Recently, 3D Gaussian Splatting (3DGS) has gained prominence due to its explicit representation with efficient training speed and real-time rendering capabilities. However, existing methods still heavily depend on accurate camera poses for reconstruction. Although some recent approaches attempt to train 3DGS models without the Structure-from-Motion (SfM) preprocessing from monocular video datasets, these methods suffer from prolonged training times, making them impractical for many applications. In this paper, we present an efficient framework that operates without any depth or matching model. Our approach initially uses SfM to quickly obtain rough camera poses within seconds, and then refines these poses by leveraging the dense representation in 3DGS. This framework effectively addresses the issue of long training times. Additionally, we integrate the densification process with joint refinement and propose a coarse-to-fine frequency-aware densification to reconstruct different levels of details. This approach prevents camera pose estimation from being trapped in local minima or drifting due to high-frequency signals. Our method significantly reduces training time from hours to minutes while achieving more accurate novel view synthesis and camera pose estimation compared to previous methods.

KeyGS: A Keyframe-Centric Gaussian Splatting Method for Monocular Image Sequences

TL;DR

This work addresses the bottleneck of pose-dependent training in 3D Gaussian Splatting for monocular sequences. It introduces KeyGS, which quickly estimates rough camera poses using sequential COLMAP SfM on keyframes and then jointly refines poses and the 3DGS representation, aided by a coarse-to-fine frequency-aware densification strategy. The approach reduces training time from hours to minutes and yields more accurate novel view synthesis and camera pose estimates than previous methods, while mitigating pose drift. Overall, KeyGS offers a practical, scalable solution for real-time or near-real-time 3D reconstruction from monocular video with robust pose handling.

Abstract

Reconstructing high-quality 3D models from sparse 2D images has garnered significant attention in computer vision. Recently, 3D Gaussian Splatting (3DGS) has gained prominence due to its explicit representation with efficient training speed and real-time rendering capabilities. However, existing methods still heavily depend on accurate camera poses for reconstruction. Although some recent approaches attempt to train 3DGS models without the Structure-from-Motion (SfM) preprocessing from monocular video datasets, these methods suffer from prolonged training times, making them impractical for many applications. In this paper, we present an efficient framework that operates without any depth or matching model. Our approach initially uses SfM to quickly obtain rough camera poses within seconds, and then refines these poses by leveraging the dense representation in 3DGS. This framework effectively addresses the issue of long training times. Additionally, we integrate the densification process with joint refinement and propose a coarse-to-fine frequency-aware densification to reconstruct different levels of details. This approach prevents camera pose estimation from being trapped in local minima or drifting due to high-frequency signals. Our method significantly reduces training time from hours to minutes while achieving more accurate novel view synthesis and camera pose estimation compared to previous methods.
Paper Structure (17 sections, 11 equations, 6 figures, 5 tables)

This paper contains 17 sections, 11 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: KeyGS Framework. For each sequence of images, we sub-sample $\frac{1}{N}$ images as keyframes and perform fast, albeit less accurate, sequential SFM with second to obtain an initial rough trajectory. We then jointly optimize the camera poses using the KeyGS method. Compared to CF3DGS, KeyGS continuously refines the camera poses to reduce accumulation errors that lead to localization drift. Additionally, KeyGS achieves detailed reconstruction by refining camera poses.
  • Figure 2: Illustration of Coarse-to-Fine Frequency-Aware Densification – The top part shows the gradient for each frequency related to alignment offset, using different scales $\sigma$ for Gaussian smoothing on the Fourier kernel $\Tilde{\mathcal{H}}(u,k)$. Larger $\sigma$ values concentrate the gradient on low frequencies, while decreasing $\sigma$ shifts it to higher frequencies. The bottom part visualizes the training process for various $\sigma$ values. At high $\sigma$(a), there are no details. As $\sigma$ decreases, rough contours emerge (b), and densification is primarily influenced by high-frequency gradients, leading to detailed structures at low $\sigma$(c).
  • Figure 3: Comparison of naive joint refinement and our proposed method, frequency-aware densification. (a) Applying naive joint refinement to 3DGS results in over-splitting of the Gaussians to fit high-frequency signals. This causes the Gaussians to become spiky, making alignment more difficult and leading to oscillations in the trajectory. (b) Our proposed frequency-aware densification method uses a coarse-to-fine approach to account for gradients of different frequencies. The reconstruction results are smoother and more accurate, leading to improved camera pose recovery.
  • Figure 4: Coarse-to-Fine Frequency-Aware Densification. Our method aligns signals by preventing premature Gaussian splitting at high frequencies. Although the signal may appear aligned early on, this approach suppresses gradient influence from high-frequency details. As the Gaussian filter scale decreases, the gradient shifts to high frequencies, enabling Gaussians to split and capture finer details.
  • Figure 5: KeyFrame-Centric SfM. Our data preprocessing method can achieve a speedup of at least 10 times when using the sequential mode compared to the exhaustive mode with full images in COLMAP.
  • ...and 1 more figures