Table of Contents
Fetching ...

FreeSplat++: Generalizable 3D Gaussian Splatting for Efficient Indoor Scene Reconstruction

Yunsong Wang, Tianxin Huang, Hanlin Chen, Gim Hee Lee

TL;DR

FreeSplat++ addresses indoor whole-scene reconstruction with a generalizable 3D Gaussian Splatting (3DGS) framework by introducing a low-cost cross-view aggregation pipeline, Pixel-wise Triplet Fusion to reduce Gaussian redundancy, and a Weighted Floater Removal strategy to suppress floaters. A depth-regularized per-scene fine-tuning stage further enhances rendering quality while preserving geometric accuracy. The approach yields substantial improvements over prior generalizable 3DGS methods in both region and whole-scene tasks, with fewer Gaussians and shorter training times, particularly when handling long input sequences. These innovations collectively enable efficient, accurate explicit 3D representations for large-scale indoor scenes and offer a practical alternative to per-scene optimization in many contexts.

Abstract

Recently, the integration of the efficient feed-forward scheme into 3D Gaussian Splatting (3DGS) has been actively explored. However, most existing methods focus on sparse view reconstruction of small regions and cannot produce eligible whole-scene reconstruction results in terms of either quality or efficiency. In this paper, we propose FreeSplat++, which focuses on extending the generalizable 3DGS to become an alternative approach to large-scale indoor whole-scene reconstruction, which has the potential of significantly accelerating the reconstruction speed and improving the geometric accuracy. To facilitate whole-scene reconstruction, we initially propose the Low-cost Cross-View Aggregation framework to efficiently process extremely long input sequences. Subsequently, we introduce a carefully designed pixel-wise triplet fusion method to incrementally aggregate the overlapping 3D Gaussian primitives from multiple views, adaptively reducing their redundancy. Furthermore, we propose a weighted floater removal strategy that can effectively reduce floaters, which serves as an explicit depth fusion approach that is crucial in whole-scene reconstruction. After the feed-forward reconstruction of 3DGS primitives, we investigate a depth-regularized per-scene fine-tuning process. Leveraging the dense, multi-view consistent depth maps obtained during the feed-forward prediction phase for an extra constraint, we refine the entire scene's 3DGS primitive to enhance rendering quality while preserving geometric accuracy. Extensive experiments confirm that our FreeSplat++ significantly outperforms existing generalizable 3DGS methods, especially in whole-scene reconstructions. Compared to conventional per-scene optimized 3DGS approaches, our method with depth-regularized per-scene fine-tuning demonstrates substantial improvements in reconstruction accuracy and a notable reduction in training time.

FreeSplat++: Generalizable 3D Gaussian Splatting for Efficient Indoor Scene Reconstruction

TL;DR

FreeSplat++ addresses indoor whole-scene reconstruction with a generalizable 3D Gaussian Splatting (3DGS) framework by introducing a low-cost cross-view aggregation pipeline, Pixel-wise Triplet Fusion to reduce Gaussian redundancy, and a Weighted Floater Removal strategy to suppress floaters. A depth-regularized per-scene fine-tuning stage further enhances rendering quality while preserving geometric accuracy. The approach yields substantial improvements over prior generalizable 3DGS methods in both region and whole-scene tasks, with fewer Gaussians and shorter training times, particularly when handling long input sequences. These innovations collectively enable efficient, accurate explicit 3D representations for large-scale indoor scenes and offer a practical alternative to per-scene optimization in many contexts.

Abstract

Recently, the integration of the efficient feed-forward scheme into 3D Gaussian Splatting (3DGS) has been actively explored. However, most existing methods focus on sparse view reconstruction of small regions and cannot produce eligible whole-scene reconstruction results in terms of either quality or efficiency. In this paper, we propose FreeSplat++, which focuses on extending the generalizable 3DGS to become an alternative approach to large-scale indoor whole-scene reconstruction, which has the potential of significantly accelerating the reconstruction speed and improving the geometric accuracy. To facilitate whole-scene reconstruction, we initially propose the Low-cost Cross-View Aggregation framework to efficiently process extremely long input sequences. Subsequently, we introduce a carefully designed pixel-wise triplet fusion method to incrementally aggregate the overlapping 3D Gaussian primitives from multiple views, adaptively reducing their redundancy. Furthermore, we propose a weighted floater removal strategy that can effectively reduce floaters, which serves as an explicit depth fusion approach that is crucial in whole-scene reconstruction. After the feed-forward reconstruction of 3DGS primitives, we investigate a depth-regularized per-scene fine-tuning process. Leveraging the dense, multi-view consistent depth maps obtained during the feed-forward prediction phase for an extra constraint, we refine the entire scene's 3DGS primitive to enhance rendering quality while preserving geometric accuracy. Extensive experiments confirm that our FreeSplat++ significantly outperforms existing generalizable 3DGS methods, especially in whole-scene reconstructions. Compared to conventional per-scene optimized 3DGS approaches, our method with depth-regularized per-scene fine-tuning demonstrates substantial improvements in reconstruction accuracy and a notable reduction in training time.

Paper Structure

This paper contains 16 sections, 17 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: Results on whole scene reconstruction. FreeSplat++$_{rft}$ is the per-scene fine-tuned results. Our model excels in efficiently reconstructing geometrically accurate 3D Gaussian primitives. Furthermore, FreeSplat++ shows superior view consistency, e.g. when rendered from Bird-Eye's View, demonstrating the significance of generalizable 3DGS for whole scene reconstruction.
  • Figure 2: Visualization of the floaters challenge in generalizable 3DGS. The noise of predicted depth maps (red regions) may lead to severe floaters in whole scene reconstruction.
  • Figure 3: Framework of FreeSplat++. The high-level design of FreeSplat++ includes: (a) Feed-Forward Gaussians Initialization: given input sparse sequence of images, we construct cost volumes between nearby views and introduce Pixel-aligned Triplet Fusion (PTF) module, where we progressively aggregate and update local/global Gaussian triplets based on pixel-wise alignment. (b) Weighted Floater Removal: Leverage the accumulated gaussian weights in our PTF process, we further align the global and local gaussians and incrementally adjust the gaussian opacities. (c) Depth-Regularized Fine-tuning: We can optionally conduct a fast per-scene fine-tuning step with multi-view consistent depth regularization thanks to our geometrically accurate gaussian initialization.
  • Figure 4: Visual illustration of PTF. The PTF incrementally projects current global Gaussians to input views and computes their pixel-wise distance with local Gaussians. Nearby local Gaussians are then fused using a lightweight Gate Recurrent Unit (GRU) network gru.
  • Figure 5: Qualitative Ablation Study. The first and second row are whole scene reconstruction results from ScanNet and ScanNet++, respectively.
  • ...and 2 more figures