Table of Contents
Fetching ...

S3R-GS: Streamlining the Pipeline for Large-Scale Street Scene Reconstruction

Guangting Zheng, Jiajun Deng, Xiaomeng Chu, Yu Yuan, Houqiang Li, Yanyong Zhang

TL;DR

This work tackles the scalability bottlenecks of large-scale street scene reconstruction with 3D Gaussian Splatting by introducing S3R-GS, a streamlined pipeline that eliminates unnecessary local-to-global transforms via instance-specific projection, reduces 3D-to-2D projections with temporal visibility, and renders distant content efficiently through adaptive LOD. It further enhances practicality by using BEV-semantic initialization and 2D box-based NeuralODE motion modeling to handle in-the-wild scenarios without 3D bounding boxes. The approach yields state-of-the-art rendering quality and substantial speedups across Argoverse 2, KITTI, and nuScenes datasets, demonstrating strong scalability and applicability to real-world driving scenes. Overall, S3R-GS provides an practical, high-performance framework for dynamic street scene reconstruction with reduced annotation burden and improved robustness.

Abstract

Recently, 3D Gaussian Splatting (3DGS) has reshaped the field of photorealistic 3D reconstruction, achieving impressive rendering quality and speed. However, when applied to large-scale street scenes, existing methods suffer from rapidly escalating per-viewpoint reconstruction costs as scene size increases, leading to significant computational overhead. After revisiting the conventional pipeline, we identify three key factors accounting for this issue: unnecessary local-to-global transformations, excessive 3D-to-2D projections, and inefficient rendering of distant content. To address these challenges, we propose S3R-GS, a 3DGS framework that Streamlines the pipeline for large-scale Street Scene Reconstruction, effectively mitigating these limitations. Moreover, most existing street 3DGS methods rely on ground-truth 3D bounding boxes to separate dynamic and static components, but 3D bounding boxes are difficult to obtain, limiting real-world applicability. To address this, we propose an alternative solution with 2D boxes, which are easier to annotate or can be predicted by off-the-shelf vision foundation models. Such designs together make S3R-GS readily adapt to large, in-the-wild scenarios. Extensive experiments demonstrate that S3R-GS enhances rendering quality and significantly accelerates reconstruction. Remarkably, when applied to videos from the challenging Argoverse2 dataset, it achieves state-of-the-art PSNR and SSIM, reducing reconstruction time to below 50%--and even 20%--of competing methods.

S3R-GS: Streamlining the Pipeline for Large-Scale Street Scene Reconstruction

TL;DR

This work tackles the scalability bottlenecks of large-scale street scene reconstruction with 3D Gaussian Splatting by introducing S3R-GS, a streamlined pipeline that eliminates unnecessary local-to-global transforms via instance-specific projection, reduces 3D-to-2D projections with temporal visibility, and renders distant content efficiently through adaptive LOD. It further enhances practicality by using BEV-semantic initialization and 2D box-based NeuralODE motion modeling to handle in-the-wild scenarios without 3D bounding boxes. The approach yields state-of-the-art rendering quality and substantial speedups across Argoverse 2, KITTI, and nuScenes datasets, demonstrating strong scalability and applicability to real-world driving scenes. Overall, S3R-GS provides an practical, high-performance framework for dynamic street scene reconstruction with reduced annotation burden and improved robustness.

Abstract

Recently, 3D Gaussian Splatting (3DGS) has reshaped the field of photorealistic 3D reconstruction, achieving impressive rendering quality and speed. However, when applied to large-scale street scenes, existing methods suffer from rapidly escalating per-viewpoint reconstruction costs as scene size increases, leading to significant computational overhead. After revisiting the conventional pipeline, we identify three key factors accounting for this issue: unnecessary local-to-global transformations, excessive 3D-to-2D projections, and inefficient rendering of distant content. To address these challenges, we propose S3R-GS, a 3DGS framework that Streamlines the pipeline for large-scale Street Scene Reconstruction, effectively mitigating these limitations. Moreover, most existing street 3DGS methods rely on ground-truth 3D bounding boxes to separate dynamic and static components, but 3D bounding boxes are difficult to obtain, limiting real-world applicability. To address this, we propose an alternative solution with 2D boxes, which are easier to annotate or can be predicted by off-the-shelf vision foundation models. Such designs together make S3R-GS readily adapt to large, in-the-wild scenarios. Extensive experiments demonstrate that S3R-GS enhances rendering quality and significantly accelerates reconstruction. Remarkably, when applied to videos from the challenging Argoverse2 dataset, it achieves state-of-the-art PSNR and SSIM, reducing reconstruction time to below 50%--and even 20%--of competing methods.

Paper Structure

This paper contains 19 sections, 7 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Comparison of Reconstruction Pipelines.(a) In the conventional pipeline, to render the view at timestep $t$, each object's Gaussians are sequentially transformed from their respective local coordinate system to the global coordinate system. Next, all Gaussians in the global coordinate system are projected onto the camera plane. Finally, the Gaussians within the view frustum are rendered using $\alpha$-blending, regardless of distance. (b) Our streamlined reconstruction pipeline first employs temporal visibility to identify the visible Gaussians at timestep $t$. To avoid unnecessary transformations, we use instance-specific projection matrices to directly project all Gaussians onto the camera plane. Subsequently, we apply the Adaptive LOD method to cull distant Gaussians whose 2D scales are smaller than the LOD threshold. We rasterize the remaining Gaussians with $\alpha$-blending. Finally, we update the temporal visibility using Gaussian visibility obtained by the rendering process. (c) Our pipeline eliminates these redundancies, significantly accelerating the reconstruction process.
  • Figure 2: S3R-GS Framework. At the scene modeling stage, S3R-GS first leverages a BEV-semantic initialization augmentation method to supplement the points of tall structures in street scenes. Next, tracked 2D boxes to distinguish between dynamic and static elements, integrating a NeuralODE model for precise, continuous object poses. These designs enable S3R-GS to generalize effectively to in-the-wild scenarios. During scene reconstruction, S3R-GS identifies visible Gaussians at each time step and projects them onto the 2D image plane using instance-specific projection matrices. To efficiently render distant content, S3R-GS employs an adaptive strategy that (1) filters out distant 3D Gaussians with small projected 2D scales, (2) randomly culls Gaussians based on depth, (3) introduces noisy offsets to the remaining distant Gaussians, and (4) queries their colors from a distance-aware neural field. After that, the remaining Gaussians are rasterized using $\alpha$-blending. Our pipeline reduces computational redundancy, significantly lowering per-viewpoint reconstruction costs.
  • Figure 3: Qualitative Comparisons on the KITTI Dataset. StreetGaussian yan2024street fails to accurately model the details of vehicles and pedestrians. 4DGF 4dgf struggle to model the details of large-scale scenes. In contrast, our method not only reconstructs large-scale scenes with full detail but also achieves high-quality reconstructions of vehicles and pedestrians.
  • Figure 4: Scalability Comparison of Per-viewpoint Reconstruction Times. As the scene scales up, the per-viewpoint reconstruction cost of 4DGF rises rapidly, whereas our method remains nearly constant.