Table of Contents
Fetching ...

VEGS: View Extrapolation of Urban Scenes in 3D Gaussian Splatting using Learned Priors

Sungwon Hwang, Min-Jung Kim, Taewoong Kang, Jayeon Kang, Jaegul Choo

TL;DR

This work tackles Extrapolated View Synthesis (EVS) for urban scenes captured with forward-facing driving cameras. It introduces VEGS, a method that integrates dense LiDAR-based initialization, 3D Gaussian Splatting with dynamic objects, and two priors—surface normals and a fine-tuned diffusion model—to improve rendering quality on views outside the training distribution. The approach includes a covariance-guided regularization to prevent cavities and a diffusion-score distillation to inject scene-specific visual priors, yielding improved EVS metrics and coherent scene editing capabilities. The results on KITTI datasets demonstrate robust EVS performance, highlighting the method's potential for real-time, view-consistent urban scene rendering in applications like autonomous driving and AR/VR visualization.

Abstract

Neural rendering-based urban scene reconstruction methods commonly rely on images collected from driving vehicles with cameras facing and moving forward. Although these methods can successfully synthesize from views similar to training camera trajectory, directing the novel view outside the training camera distribution does not guarantee on-par performance. In this paper, we tackle the Extrapolated View Synthesis (EVS) problem by evaluating the reconstructions on views such as looking left, right or downwards with respect to training camera distributions. To improve rendering quality for EVS, we initialize our model by constructing dense LiDAR map, and propose to leverage prior scene knowledge such as surface normal estimator and large-scale diffusion model. Qualitative and quantitative comparisons demonstrate the effectiveness of our methods on EVS. To the best of our knowledge, we are the first to address the EVS problem in urban scene reconstruction. Link to our project page: https://vegs3d.github.io/.

VEGS: View Extrapolation of Urban Scenes in 3D Gaussian Splatting using Learned Priors

TL;DR

This work tackles Extrapolated View Synthesis (EVS) for urban scenes captured with forward-facing driving cameras. It introduces VEGS, a method that integrates dense LiDAR-based initialization, 3D Gaussian Splatting with dynamic objects, and two priors—surface normals and a fine-tuned diffusion model—to improve rendering quality on views outside the training distribution. The approach includes a covariance-guided regularization to prevent cavities and a diffusion-score distillation to inject scene-specific visual priors, yielding improved EVS metrics and coherent scene editing capabilities. The results on KITTI datasets demonstrate robust EVS performance, highlighting the method's potential for real-time, view-consistent urban scene rendering in applications like autonomous driving and AR/VR visualization.

Abstract

Neural rendering-based urban scene reconstruction methods commonly rely on images collected from driving vehicles with cameras facing and moving forward. Although these methods can successfully synthesize from views similar to training camera trajectory, directing the novel view outside the training camera distribution does not guarantee on-par performance. In this paper, we tackle the Extrapolated View Synthesis (EVS) problem by evaluating the reconstructions on views such as looking left, right or downwards with respect to training camera distributions. To improve rendering quality for EVS, we initialize our model by constructing dense LiDAR map, and propose to leverage prior scene knowledge such as surface normal estimator and large-scale diffusion model. Qualitative and quantitative comparisons demonstrate the effectiveness of our methods on EVS. To the best of our knowledge, we are the first to address the EVS problem in urban scene reconstruction. Link to our project page: https://vegs3d.github.io/.
Paper Structure (41 sections, 20 equations, 11 figures, 4 tables)

This paper contains 41 sections, 20 equations, 11 figures, 4 tables.

Figures (11)

  • Figure 1: (a) Illustration of Extrapolated View Synthesis (EVS) problem in urban scenes reconstructed with forward-facing cameras. In contrast to conventional test cameras similar to training camera poses, we evaluate view synthesis on cameras distant from training camera distribution. (b) Qualitative comparison on EVS to baselines.
  • Figure 2: Our dynamic scene model combines camera, LiDAR, and bounding box estimations with 3D Gaussian Splatting kerbl20233d Aside from reconstruction loss $\mathcal{L}_c$, we additionally supervise Gaussian covariances with surface normal priors for improved extrapolated view synthesis (EVS). We also make use of a large-scale diffusion model to distill its knowledge directly to renderings of view-augmented cameras.
  • Figure 3: (a) Working mechanism of $\mathcal{L}_{\text{cov}}=\mathcal{L}_{\text{axis}} + \mathcal{L}_{\text{scale}}$. $\mathcal{L}_{\text{axis}}$ aligns covariance axes to a surface normal vector, and $\mathcal{L}_{\text{scale}}$ minimizes the scale along the covariance axis aligned with surface normal, all of which prevents the Gaussian covariance from minimally satisfying a pixel view frustum, which causes cavity when viewed from another angle. (b) Visualizing $\mathcal{L}_{\text{axis}}$ for different alignment between normal and covariances. $\mathcal{L}_{\text{axis}}$ is minimized when an axis aligns with the normal. See supplements for detailed derivation.
  • Figure 4: Qualitative comparison on KITTI-360Liao2022PAMI for extrapolated view synthesis. EVS-D and EVS-LR refers to extrapolated views facing downwards and left/right, respectively. Test Cam. refers to the conventional test camera sampled from a set of forward-facing cameras. We also report training images for reference that maximally covers the view space of EVS from another location for comparison. Ours outperforms the baselines in terms of geometry and visual sanity.
  • Figure 5: Qualitative comparison on KITTIGeiger2012CVPR dataset from conventional test camera (top) and EVS-D (bottom).
  • ...and 6 more figures