Table of Contents
Fetching ...

SfM-Free 3D Gaussian Splatting via Hierarchical Training

Bo Ji, Angela Yao

TL;DR

This work tackles SfM-free novel view synthesis for video input by presenting SfM-Free 3D Gaussian Splatting (SFGS). The method combines a hierarchical training strategy that trains and merges segment-specific base 3D Gaussian splatting representations, with video frame interpolation to stabilize pose estimation when camera motion is large, and multi-source supervision to mitigate overfitting. Key contributions include a practical merging scheme based on Gaussian importance scores, a hierarchical level framework that yields a unified scene representation, and the use of VFI-derived frames and pseudo-views to improve training signal. Empirically, the approach achieves state-of-the-art performance among SfM-free methods on Tanks and Temples (+2.25 dB PSNR on average, up to +3.72 dB in Barn) and CO3D-V2 (+1.74 dB average, up to +3.90 dB), demonstrating strong generalization without SfM preprocessing.

Abstract

Standard 3D Gaussian Splatting (3DGS) relies on known or pre-computed camera poses and a sparse point cloud, obtained from structure-from-motion (SfM) preprocessing, to initialize and grow 3D Gaussians. We propose a novel SfM-Free 3DGS (SFGS) method for video input, eliminating the need for known camera poses and SfM preprocessing. Our approach introduces a hierarchical training strategy that trains and merges multiple 3D Gaussian representations -- each optimized for specific scene regions -- into a single, unified 3DGS model representing the entire scene. To compensate for large camera motions, we leverage video frame interpolation models. Additionally, we incorporate multi-source supervision to reduce overfitting and enhance representation. Experimental results reveal that our approach significantly surpasses state-of-the-art SfM-free novel view synthesis methods. On the Tanks and Temples dataset, we improve PSNR by an average of 2.25dB, with a maximum gain of 3.72dB in the best scene. On the CO3D-V2 dataset, we achieve an average PSNR boost of 1.74dB, with a top gain of 3.90dB. The code is available at https://github.com/jibo27/3DGS_Hierarchical_Training.

SfM-Free 3D Gaussian Splatting via Hierarchical Training

TL;DR

This work tackles SfM-free novel view synthesis for video input by presenting SfM-Free 3D Gaussian Splatting (SFGS). The method combines a hierarchical training strategy that trains and merges segment-specific base 3D Gaussian splatting representations, with video frame interpolation to stabilize pose estimation when camera motion is large, and multi-source supervision to mitigate overfitting. Key contributions include a practical merging scheme based on Gaussian importance scores, a hierarchical level framework that yields a unified scene representation, and the use of VFI-derived frames and pseudo-views to improve training signal. Empirically, the approach achieves state-of-the-art performance among SfM-free methods on Tanks and Temples (+2.25 dB PSNR on average, up to +3.72 dB in Barn) and CO3D-V2 (+1.74 dB average, up to +3.90 dB), demonstrating strong generalization without SfM preprocessing.

Abstract

Standard 3D Gaussian Splatting (3DGS) relies on known or pre-computed camera poses and a sparse point cloud, obtained from structure-from-motion (SfM) preprocessing, to initialize and grow 3D Gaussians. We propose a novel SfM-Free 3DGS (SFGS) method for video input, eliminating the need for known camera poses and SfM preprocessing. Our approach introduces a hierarchical training strategy that trains and merges multiple 3D Gaussian representations -- each optimized for specific scene regions -- into a single, unified 3DGS model representing the entire scene. To compensate for large camera motions, we leverage video frame interpolation models. Additionally, we incorporate multi-source supervision to reduce overfitting and enhance representation. Experimental results reveal that our approach significantly surpasses state-of-the-art SfM-free novel view synthesis methods. On the Tanks and Temples dataset, we improve PSNR by an average of 2.25dB, with a maximum gain of 3.72dB in the best scene. On the CO3D-V2 dataset, we achieve an average PSNR boost of 1.74dB, with a top gain of 3.90dB. The code is available at https://github.com/jibo27/3DGS_Hierarchical_Training.

Paper Structure

This paper contains 13 sections, 4 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Novel view synthesis results (left) alongside the projected centers of 3D Gaussians (right). Each blue dot represents a projected 3D Gaussian center. Our proposal offers two key advantages: 1) Our 3D Gaussians are well-distributed across the scene, whereas CF-3DGS Fu_2024_CVPR has a notable absence of 3D Gaussians on the image's left side (e.g., in the red region); 2) Our learned 3D Gaussians are of high quality. While CF-3DGS places numerous 3D Gaussians in the green region, the rendering quality there is notably inferior to ours.
  • Figure 2: Overview of our proposal. We partition the video into multiple segments, train a base 3DGS model on each segment individually, and then iteratively merge these base models into a single, unified 3DGS model representing the entire scene.
  • Figure 3: Effect of VFI on relative pose estimation between $I_{i}$ (\ref{['fig:vfi_help_pose_estimation_I_prev']}) and $I_{i+1}$ (\ref{['fig:vfi_help_pose_estimation_I']}). Fig \ref{['fig:vfi_help_pose_estimation_I_interpolated']} shows the interpolated frame. In Fig \ref{['fig:vfi_help_pose_estimation_I_wovfi']}, artifacts are noticeable in regions affected by camera movement, which VFI helps reduce. Fig \ref{['fig:vfi_help_pose_estimation_I_05_withvfi']} and \ref{['fig:vfi_help_pose_estimation_I_withvfi']} show fewer artifacts in the rendered interpolated and original frames.
  • Figure 4: Qualitative novel view synthesis results on Tanks and Temples Knapitsch2017. Our proposal achieves superior rendering quality.
  • Figure 5: Qualitative novel view synthesis results on CO3D-V2 reizenstein2021common. Our proposal achieves superior rendering quality.