Table of Contents
Fetching ...

Learning Efficient Fuse-and-Refine for Feed-Forward 3D Gaussian Splatting

Yiming Wang, Lucy Chai, Xuan Luo, Michael Niemeyer, Manuel Lagunas, Stephen Lombardi, Siyu Tang, Tiancheng Sun

TL;DR

This work tackles fast, high-quality novel view synthesis from sparse views by addressing the redundancy and limited 3D flexibility of pixel-aligned Gaussian primitives. It introduces Fuse-and-Refine, a learning-based module that merges and refines Gaussians in a canonical 3D space via a hybrid Splat-Voxel representation, supported by a Sparse Voxel Transformer. The approach enables history-aware online reconstruction and zero-shot streaming for dynamic scenes, achieving state-of-the-art results on static and streaming datasets with interactive rates on a single GPU. The method demonstrates robust performance across multi-view datasets and offers extensible integration with other Gaussian Splatting frameworks through the Splat-Voxel design, while acknowledging limitations in large-baseline setups and temporal warping artifacts. Overall, the framework advances real-time, temporally coherent 3D reconstruction from sparse inputs, with significant implications for AR/VR and immersive streaming applications.

Abstract

Recent advances in feed-forward 3D Gaussian Splatting have led to rapid improvements in efficient scene reconstruction from sparse views. However, most existing approaches construct Gaussian primitives directly aligned with the pixels in one or more of the input images. This leads to redundancies in the representation when input views overlap and constrains the position of the primitives to lie along the input rays without full flexibility in 3D space. Moreover, these pixel-aligned approaches do not naturally generalize to dynamic scenes, where effectively leveraging temporal information requires resolving both redundant and newly appearing content across frames. To address these limitations, we introduce a novel Fuse-and-Refine module that enhances existing feed-forward models by merging and refining the primitives in a canonical 3D space. At the core of our method is an efficient hybrid Splat-Voxel representation: from an initial set of pixel-aligned Gaussian primitives, we aggregate local features into a coarse-to-fine voxel hierarchy, and then use a sparse voxel transformer to process these voxel features and generate refined Gaussian primitives. By fusing and refining an arbitrary number of inputs into a consistent set of primitives, our representation effectively reduces redundancy and naturally adapts to temporal frames, enabling history-aware online reconstruction of dynamic scenes. Our approach achieves state-of-the-art performance in both static and streaming scene reconstructions while running at interactive rates (15 fps with 350ms delay) on a single H100 GPU.

Learning Efficient Fuse-and-Refine for Feed-Forward 3D Gaussian Splatting

TL;DR

This work tackles fast, high-quality novel view synthesis from sparse views by addressing the redundancy and limited 3D flexibility of pixel-aligned Gaussian primitives. It introduces Fuse-and-Refine, a learning-based module that merges and refines Gaussians in a canonical 3D space via a hybrid Splat-Voxel representation, supported by a Sparse Voxel Transformer. The approach enables history-aware online reconstruction and zero-shot streaming for dynamic scenes, achieving state-of-the-art results on static and streaming datasets with interactive rates on a single GPU. The method demonstrates robust performance across multi-view datasets and offers extensible integration with other Gaussian Splatting frameworks through the Splat-Voxel design, while acknowledging limitations in large-baseline setups and temporal warping artifacts. Overall, the framework advances real-time, temporally coherent 3D reconstruction from sparse inputs, with significant implications for AR/VR and immersive streaming applications.

Abstract

Recent advances in feed-forward 3D Gaussian Splatting have led to rapid improvements in efficient scene reconstruction from sparse views. However, most existing approaches construct Gaussian primitives directly aligned with the pixels in one or more of the input images. This leads to redundancies in the representation when input views overlap and constrains the position of the primitives to lie along the input rays without full flexibility in 3D space. Moreover, these pixel-aligned approaches do not naturally generalize to dynamic scenes, where effectively leveraging temporal information requires resolving both redundant and newly appearing content across frames. To address these limitations, we introduce a novel Fuse-and-Refine module that enhances existing feed-forward models by merging and refining the primitives in a canonical 3D space. At the core of our method is an efficient hybrid Splat-Voxel representation: from an initial set of pixel-aligned Gaussian primitives, we aggregate local features into a coarse-to-fine voxel hierarchy, and then use a sparse voxel transformer to process these voxel features and generate refined Gaussian primitives. By fusing and refining an arbitrary number of inputs into a consistent set of primitives, our representation effectively reduces redundancy and naturally adapts to temporal frames, enabling history-aware online reconstruction of dynamic scenes. Our approach achieves state-of-the-art performance in both static and streaming scene reconstructions while running at interactive rates (15 fps with 350ms delay) on a single H100 GPU.

Paper Structure

This paper contains 45 sections, 15 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Method Overview. Our Fuse-and-Refine module takes as input Gaussian primitives produced by existing feed-forward models, either from the current frame or warped from previous reconstructions, and produces fused and refined primitives that improve scene reconstruction. The input primitives are first deposited into a high-resolution voxel grid with a splat-to-voxel Transfer strategy, which is then adaptively sparsified to construct a coarse-to-fine voxel hierarchy. A sparse voxel transformer is applied at the coarse level to capture global context, and new primitives are subsequently generated at the high-resolution level.
  • Figure 2: History-aware Streaming System. Our method can generalize to dynamic scenes during inference while trained only on static scenes. The hybrid Splat-Voxel model first extracts input image features using a multi-view transformer, which outputs pixel-aligned Gaussian splats for each input image with associated features. The splat features are then deposited onto a coarse-to-fine voxel grid using the decoded position, and a secondary sparse voxel transformer processes the grid features to output final Gaussian parameters. To merge history, we compute the triangulated scene flow from the input views and perform keypoint guided deformations. These deformed splats can be treated identically to the input-aligned splats, and be similarly deposited into voxel grid to merge the previous state with the current state.
  • Figure 3: (Top) Per-frame reconstruction methods, which produce an independent scene reconstruction at each time step, are prone to flickering artifacts. (Bottom) In constrast, our history-aware novel view streaming model merges previous and current frame information, allowing us to better model occluded regions and improve temporal stability. Our method achieve state-of-the-art visual quality and temporal consistency, and runs in interactive rate (15 fps with a 350ms delay$^*$) on two view inputs of resolution $320 \times 240$.
  • Figure 4: Comparison of Multi-View and Voxel Transformers. Validation curves on the DL3DV dataset show that our 3D Sparse Voxel Transformer converges faster and achieves significantly better final performance when initialized with a pre-trained 2D Multi-View Transformer, compared to training with the 2D Multi-View Transformer alone. Our method also supports joint training of the Multi-View and Voxel Transformers, leading to further performance improvements as shown in Table \ref{['tab:cmp_joint_twostage_train']}.
  • Figure 5: Reconstruction Close-ups on Dynamic Scenes. We show zoomed-in novel view reconstructions of GS-LRM and our model on dynamic scene datasets. Our model better handles occlusion boundaries with sharper detail.
  • ...and 3 more figures