Table of Contents
Fetching ...

StreamGS: Online Generalizable Gaussian Splatting Reconstruction for Unposed Image Streams

Yang LI, Jinglu Wang, Lei Chu, Xiao Li, Shiu-hong Kao, Ying-Cong Chen, Yan Lu

TL;DR

StreamGS tackles online, unposed 3D Gaussian Splatting by progressively constructing a 3DGS from image streams in a fully feed-forward manner. It combines an initial coarse two-view reconstruction with content-adaptive refinement and cross-frame feature aggregation, followed by an adaptive density control that merges matched Gaussians across frames. The approach achieves quality on par with optimization-based methods while being up to 150 times faster and exhibits strong generalizability on out-of-domain scenes. This work enables real-time, pose-free reconstruction and rendering from continuous video streams, advancing interactive 3D vision and AR/VR applications.

Abstract

The advent of 3D Gaussian Splatting (3DGS) has advanced 3D scene reconstruction and novel view synthesis. With the growing interest of interactive applications that need immediate feedback, online 3DGS reconstruction in real-time is in high demand. However, none of existing methods yet meet the demand due to three main challenges: the absence of predetermined camera parameters, the need for generalizable 3DGS optimization, and the necessity of reducing redundancy. We propose StreamGS, an online generalizable 3DGS reconstruction method for unposed image streams, which progressively transform image streams to 3D Gaussian streams by predicting and aggregating per-frame Gaussians. Our method overcomes the limitation of the initial point reconstruction \cite{dust3r} in tackling out-of-domain (OOD) issues by introducing a content adaptive refinement. The refinement enhances cross-frame consistency by establishing reliable pixel correspondences between adjacent frames. Such correspondences further aid in merging redundant Gaussians through cross-frame feature aggregation. The density of Gaussians is thereby reduced, empowering online reconstruction by significantly lowering computational and memory costs. Extensive experiments on diverse datasets have demonstrated that StreamGS achieves quality on par with optimization-based approaches but does so 150 times faster, and exhibits superior generalizability in handling OOD scenes.

StreamGS: Online Generalizable Gaussian Splatting Reconstruction for Unposed Image Streams

TL;DR

StreamGS tackles online, unposed 3D Gaussian Splatting by progressively constructing a 3DGS from image streams in a fully feed-forward manner. It combines an initial coarse two-view reconstruction with content-adaptive refinement and cross-frame feature aggregation, followed by an adaptive density control that merges matched Gaussians across frames. The approach achieves quality on par with optimization-based methods while being up to 150 times faster and exhibits strong generalizability on out-of-domain scenes. This work enables real-time, pose-free reconstruction and rendering from continuous video streams, advancing interactive 3D vision and AR/VR applications.

Abstract

The advent of 3D Gaussian Splatting (3DGS) has advanced 3D scene reconstruction and novel view synthesis. With the growing interest of interactive applications that need immediate feedback, online 3DGS reconstruction in real-time is in high demand. However, none of existing methods yet meet the demand due to three main challenges: the absence of predetermined camera parameters, the need for generalizable 3DGS optimization, and the necessity of reducing redundancy. We propose StreamGS, an online generalizable 3DGS reconstruction method for unposed image streams, which progressively transform image streams to 3D Gaussian streams by predicting and aggregating per-frame Gaussians. Our method overcomes the limitation of the initial point reconstruction \cite{dust3r} in tackling out-of-domain (OOD) issues by introducing a content adaptive refinement. The refinement enhances cross-frame consistency by establishing reliable pixel correspondences between adjacent frames. Such correspondences further aid in merging redundant Gaussians through cross-frame feature aggregation. The density of Gaussians is thereby reduced, empowering online reconstruction by significantly lowering computational and memory costs. Extensive experiments on diverse datasets have demonstrated that StreamGS achieves quality on par with optimization-based approaches but does so 150 times faster, and exhibits superior generalizability in handling OOD scenes.

Paper Structure

This paper contains 27 sections, 8 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: The proposed StreamGS efficiently transforms image streams into Gaussian streams by progressively reconstructing and aggregating per-frame 3D Gaussians. We show our reconstructed 3DGS (visualized as points) alongside estimated camera poses (in blue), and synthesized novel views.
  • Figure 2: Method overview. Our StreamGS progressively reconstruct and aggregate 3D Gaussians from the unposed image stream. Given the adjacent image pair $(\mathbf{I}^{t-1}, \mathbf{I}^{t})$, we first perform the initial reconstruction that predicts pixel-wise 3D points with their features and coarse camera poses, using a pretrained coarse predictor. Since the coarse predictions may suffer from OOD issues, we refine both the camera poses and 3D positions by establishing new point-wise correspondences. We aggregate cross-frame image and 3D features by warping and merging to reduce redundancy. Finally we decode the aggregated features to Gaussian primitives.
  • Figure 3: Qualitative comparison on novel view synthesis. We show the results on both source domain, RE10K re10k, and other domains, ScanNet scannet, DL3DV dl3dv and MVImgNet mvimgnet. All generalizable methods are trained only on RE10K and tested on the other datasets. StreamGS outperforms other methods in several challenging scenarios, especially for the out-of-domain data.
  • Figure 4: Visual comparison of Gaussian reconstruction and novel view synthesis from image streams with ScanNet scannet dataset. Unlike MVSplat mvsplat, which struggles with view aggregation, our results show significantly better visual quality on OOD data. Note that Our StreamGS and MVSplat are both trained with RE10K re10k data, and MVSplat needs predetermined cameras.
  • Figure 5: Reconstruction speed measured by frames processed per second (FPS). The x-axis is log-scaled for the better visualization.