Table of Contents
Fetching ...

EVolSplat: Efficient Volume-based Gaussian Splatting for Urban View Synthesis

Sheng Miao, Jiaxin Huang, Dongfeng Bai, Xu Yan, Hongyu Zhou, Yue Wang, Bingbing Liu, Andreas Geiger, Yiyi Liao

TL;DR

EVolSplat tackles the problem of slow, per-scene optimization in urban novel view synthesis by introducing a feed-forward, volume-based Gaussian splatting approach that operates in a unified global volume. It decouples foreground geometry and appearance from distant background via a generalizable hemisphere model, using a sparse 3D CNN to predict Gaussian primitives and an occlusion-aware image-based rendering module to recover high-frequency details; a depth-prior initialized global point cloud provides robust geometric priors. The method employs a recursive offset refinement for Gaussian centers, an entropy-regularized training loss, and a background model to enable real-time rendering with competitive photorealism on KITTI-360 and Waymo, often outperforming both feed-forward and some optimization-based baselines. This work advances practical urban NVS by delivering fast, memory-efficient, generalizable reconstructions suitable for autonomous driving and related applications, while acknowledging limitations in dynamic scenes and distant background fidelity.

Abstract

Novel view synthesis of urban scenes is essential for autonomous driving-related applications.Existing NeRF and 3DGS-based methods show promising results in achieving photorealistic renderings but require slow, per-scene optimization. We introduce EVolSplat, an efficient 3D Gaussian Splatting model for urban scenes that works in a feed-forward manner. Unlike existing feed-forward, pixel-aligned 3DGS methods, which often suffer from issues like multi-view inconsistencies and duplicated content, our approach predicts 3D Gaussians across multiple frames within a unified volume using a 3D convolutional network. This is achieved by initializing 3D Gaussians with noisy depth predictions, and then refining their geometric properties in 3D space and predicting color based on 2D textures. Our model also handles distant views and the sky with a flexible hemisphere background model. This enables us to perform fast, feed-forward reconstruction while achieving real-time rendering. Experimental evaluations on the KITTI-360 and Waymo datasets show that our method achieves state-of-the-art quality compared to existing feed-forward 3DGS- and NeRF-based methods.

EVolSplat: Efficient Volume-based Gaussian Splatting for Urban View Synthesis

TL;DR

EVolSplat tackles the problem of slow, per-scene optimization in urban novel view synthesis by introducing a feed-forward, volume-based Gaussian splatting approach that operates in a unified global volume. It decouples foreground geometry and appearance from distant background via a generalizable hemisphere model, using a sparse 3D CNN to predict Gaussian primitives and an occlusion-aware image-based rendering module to recover high-frequency details; a depth-prior initialized global point cloud provides robust geometric priors. The method employs a recursive offset refinement for Gaussian centers, an entropy-regularized training loss, and a background model to enable real-time rendering with competitive photorealism on KITTI-360 and Waymo, often outperforming both feed-forward and some optimization-based baselines. This work advances practical urban NVS by delivering fast, memory-efficient, generalizable reconstructions suitable for autonomous driving and related applications, while acknowledging limitations in dynamic scenes and distant background fidelity.

Abstract

Novel view synthesis of urban scenes is essential for autonomous driving-related applications.Existing NeRF and 3DGS-based methods show promising results in achieving photorealistic renderings but require slow, per-scene optimization. We introduce EVolSplat, an efficient 3D Gaussian Splatting model for urban scenes that works in a feed-forward manner. Unlike existing feed-forward, pixel-aligned 3DGS methods, which often suffer from issues like multi-view inconsistencies and duplicated content, our approach predicts 3D Gaussians across multiple frames within a unified volume using a 3D convolutional network. This is achieved by initializing 3D Gaussians with noisy depth predictions, and then refining their geometric properties in 3D space and predicting color based on 2D textures. Our model also handles distant views and the sky with a flexible hemisphere background model. This enables us to perform fast, feed-forward reconstruction while achieving real-time rendering. Experimental evaluations on the KITTI-360 and Waymo datasets show that our method achieves state-of-the-art quality compared to existing feed-forward 3DGS- and NeRF-based methods.

Paper Structure

This paper contains 30 sections, 19 equations, 11 figures, 8 tables.

Figures (11)

  • Figure 1: Illustration. Left: Existing feed-forward 3DGS methods (e.g., MVSplat) predict per-pixel Gaussians with local cost volumes. When accumulating Gaussians from multiple local volumes in global coordinates, we observe inconsistencies in the accumulated Gaussians (e.g. car in the figure), leading to ghost artifacts in the rendering. In contrast, EVolSplat predicts 3DGS using a global volume, improving consistency and rendering quality. Right: Our method achieves real-time rendering while maintaining high NVS rendering quality on novel street scenes with lower memory consumption. The circle size indicates memory consumption during inference.
  • Figure 2: Method. EVolSplat learns to predict 3D Gaussians of urban scenes in a feed-forward manner. Given a set of posed images $\{I_n\}_{i=1}^N$, we first leverage off-the-shelf metric depth estimators to provide depth estimations $\{D_n\}_{n=1}^N$. The depth maps are unprojected and accumulated into a global point cloud $\mathbf{P}$, which is fed into a sparse 3D CNN for extracting a feature volume $\mathbf{F}$. We leverage the 3D context of $\mathbf{F}$ to predict the geometry attributes of 3D Gaussians, including their center $\boldsymbol{\mu}$, opacity $\boldsymbol{\alpha}$, and covariance $\boldsymbol{\Sigma}$. Furthermore, we project the 3D Gaussians to the nearest reference views to retrieve 2D context, including color window $\{\mathbf{c}_k\}_{k=1}^K$ and visibility maps $\{\mathbf{v}_k\}_{k=1}^K$ to decode their color. To model far regions, we propose a generalizable hemisphere Gaussian model, where the geometry is fixed and the color is predicted in a similar manner as the foreground volume.
  • Figure 3: Occlusion Illustration. One Gaussian in 3D space may retrieve inaccurate color information from 2D reference images due to occlusions. EVolSplat comprises geometric priors to reduce the impact of invisible colors to enhance rendering quality.
  • Figure 4: Qualitative Comparison with generalizable baselines on the KITTI-360 dataset.
  • Figure 6: Comparison with Optimization-based Methods. We show PSNR and LPIPS on the test set at different training steps. Compared with test-time optimization baselines, our method with generalizable priors converges faster and achieves better LPIPS.
  • ...and 6 more figures