Table of Contents
Fetching ...

SGD: Street View Synthesis with Gaussian Splatting and Diffusion Prior

Zhongrui Yu, Haoran Wang, Jinze Yang, Hanzhang Wang, Zeke Xie, Yunfeng Cai, Jiale Cao, Zhong Ji, Mingming Sun

TL;DR

The work tackles the problem of high-quality novel view synthesis for street scenes when training data are sparse due to vehicle-based capture. It introduces a two-stage approach that fine-tunes a diffusion model on driving data using adjacent frames and LiDAR depth, then integrates this diffusion prior into 3D Gaussian Splatting to regularize unseen views via pseudo-view guidance. The method demonstrates strong gains on KITTI and KITTI-360, particularly in sparse-view and novel-view settings, while preserving real-time rendering during inference. This approach advances autonomous driving simulation by enabling more versatile ego-vehicle viewpoint control with high-rendering fidelity across broader viewpoints.

Abstract

Novel View Synthesis (NVS) for street scenes play a critical role in the autonomous driving simulation. The current mainstream technique to achieve it is neural rendering, such as Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS). Although thrilling progress has been made, when handling street scenes, current methods struggle to maintain rendering quality at the viewpoint that deviates significantly from the training viewpoints. This issue stems from the sparse training views captured by a fixed camera on a moving vehicle. To tackle this problem, we propose a novel approach that enhances the capacity of 3DGS by leveraging prior from a Diffusion Model along with complementary multi-modal data. Specifically, we first fine-tune a Diffusion Model by adding images from adjacent frames as condition, meanwhile exploiting depth data from LiDAR point clouds to supply additional spatial information. Then we apply the Diffusion Model to regularize the 3DGS at unseen views during training. Experimental results validate the effectiveness of our method compared with current state-of-the-art models, and demonstrate its advance in rendering images from broader views.

SGD: Street View Synthesis with Gaussian Splatting and Diffusion Prior

TL;DR

The work tackles the problem of high-quality novel view synthesis for street scenes when training data are sparse due to vehicle-based capture. It introduces a two-stage approach that fine-tunes a diffusion model on driving data using adjacent frames and LiDAR depth, then integrates this diffusion prior into 3D Gaussian Splatting to regularize unseen views via pseudo-view guidance. The method demonstrates strong gains on KITTI and KITTI-360, particularly in sparse-view and novel-view settings, while preserving real-time rendering during inference. This approach advances autonomous driving simulation by enabling more versatile ego-vehicle viewpoint control with high-rendering fidelity across broader viewpoints.

Abstract

Novel View Synthesis (NVS) for street scenes play a critical role in the autonomous driving simulation. The current mainstream technique to achieve it is neural rendering, such as Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS). Although thrilling progress has been made, when handling street scenes, current methods struggle to maintain rendering quality at the viewpoint that deviates significantly from the training viewpoints. This issue stems from the sparse training views captured by a fixed camera on a moving vehicle. To tackle this problem, we propose a novel approach that enhances the capacity of 3DGS by leveraging prior from a Diffusion Model along with complementary multi-modal data. Specifically, we first fine-tune a Diffusion Model by adding images from adjacent frames as condition, meanwhile exploiting depth data from LiDAR point clouds to supply additional spatial information. Then we apply the Diffusion Model to regularize the 3DGS at unseen views during training. Experimental results validate the effectiveness of our method compared with current state-of-the-art models, and demonstrate its advance in rendering images from broader views.
Paper Structure (29 sections, 8 equations, 12 figures, 4 tables)

This paper contains 29 sections, 8 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: (a). To enable free control of ego-vehicle in autonomous driving simulation with novel view synthesis, we propose a method that leverages the prior from a Diffusion Model to provide 3DGS3dgs augmented views during training. (b). Our method preserves photo-realistic rendering quality at viewpoints that are distant from the training views while other approachesbarron2023zipnerf3dgs produce severe artifacts.
  • Figure 2: Overview of Our Method.(a). There are two training stages in the Diffusion Modelrombach2022sd fine-tuning. Firstly, the U-Net is fine-tuned by being injected with the patch-wise CLIP Image features of reference images concatenated with the CLIP text features of a text prompt. Secondly, a ControlNet is trained with the depth of the target image as the control signal. (b).The fine-tuned Diffusion Model from (a) guides the 3DGS training by providing regularization in pseudo views. For the sake of simplicity, the VAE encoder and decoder are omitted in the figure.
  • Figure 3: Qualitative comparisons of novel views rendering on the KITTI-360liao2022kitti dataset. ZipNeRFbarron2023zipnerf and 3DGS3dgs produce artifacts of the blue vehicle in (a) and blurry lane markings in (b), while our method preserves high rendering quality. Our method also fix the hole on the road surface generated by 3DGS3dgs.
  • Figure 4: Qualitative comparisons of novel views rendering on the KITTIgeiger2015kitti dataset.
  • Figure 5: Qualitative ablation results on different conditions and different fine-tuning schemes of Diffusion Modelrombach2022sd. *Target view in (b) is a novel view thus its original image is left blank.
  • ...and 7 more figures