EVolSplat: Efficient Volume-based Gaussian Splatting for Urban View Synthesis
Sheng Miao, Jiaxin Huang, Dongfeng Bai, Xu Yan, Hongyu Zhou, Yue Wang, Bingbing Liu, Andreas Geiger, Yiyi Liao
TL;DR
EVolSplat tackles the problem of slow, per-scene optimization in urban novel view synthesis by introducing a feed-forward, volume-based Gaussian splatting approach that operates in a unified global volume. It decouples foreground geometry and appearance from distant background via a generalizable hemisphere model, using a sparse 3D CNN to predict Gaussian primitives and an occlusion-aware image-based rendering module to recover high-frequency details; a depth-prior initialized global point cloud provides robust geometric priors. The method employs a recursive offset refinement for Gaussian centers, an entropy-regularized training loss, and a background model to enable real-time rendering with competitive photorealism on KITTI-360 and Waymo, often outperforming both feed-forward and some optimization-based baselines. This work advances practical urban NVS by delivering fast, memory-efficient, generalizable reconstructions suitable for autonomous driving and related applications, while acknowledging limitations in dynamic scenes and distant background fidelity.
Abstract
Novel view synthesis of urban scenes is essential for autonomous driving-related applications.Existing NeRF and 3DGS-based methods show promising results in achieving photorealistic renderings but require slow, per-scene optimization. We introduce EVolSplat, an efficient 3D Gaussian Splatting model for urban scenes that works in a feed-forward manner. Unlike existing feed-forward, pixel-aligned 3DGS methods, which often suffer from issues like multi-view inconsistencies and duplicated content, our approach predicts 3D Gaussians across multiple frames within a unified volume using a 3D convolutional network. This is achieved by initializing 3D Gaussians with noisy depth predictions, and then refining their geometric properties in 3D space and predicting color based on 2D textures. Our model also handles distant views and the sky with a flexible hemisphere background model. This enables us to perform fast, feed-forward reconstruction while achieving real-time rendering. Experimental evaluations on the KITTI-360 and Waymo datasets show that our method achieves state-of-the-art quality compared to existing feed-forward 3DGS- and NeRF-based methods.
