Table of Contents
Fetching ...

Dynamic 3D Gaussian Fields for Urban Areas

Tobias Fischer, Jonas Kulhanek, Samuel Rota Bulò, Lorenzo Porzi, Marc Pollefeys, Peter Kontschieder

TL;DR

Dynamic novel-view synthesis in large-scale urban environments is addressed with 4DGF, a hybrid representation that uses 3D Gaussian geometry as a scaffold, neural fields for compact appearance and transient geometry, and a scene-graph to model global dynamics and local non-rigid motion. The method scales to tens of thousands of images, supports heterogeneous data, and delivers state-of-the-art view synthesis with interactive rendering speeds across multiple urban benchmarks. It achieves substantial PSNR gains (over 3 dB) and orders-of-magnitude speedups (over 200x, up to 700x in some cases) over previous approaches. This work advances urban digital twins, AR/VR, and robotics simulations by enabling realistic, scalable, and fast dynamic scene reconstruction and rendering.

Abstract

We present an efficient neural 3D scene representation for novel-view synthesis (NVS) in large-scale, dynamic urban areas. Existing works are not well suited for applications like mixed-reality or closed-loop simulation due to their limited visual quality and non-interactive rendering speeds. Recently, rasterization-based approaches have achieved high-quality NVS at impressive speeds. However, these methods are limited to small-scale, homogeneous data, i.e. they cannot handle severe appearance and geometry variations due to weather, season, and lighting and do not scale to larger, dynamic areas with thousands of images. We propose 4DGF, a neural scene representation that scales to large-scale dynamic urban areas, handles heterogeneous input data, and substantially improves rendering speeds. We use 3D Gaussians as an efficient geometry scaffold while relying on neural fields as a compact and flexible appearance model. We integrate scene dynamics via a scene graph at global scale while modeling articulated motions on a local level via deformations. This decomposed approach enables flexible scene composition suitable for real-world applications. In experiments, we surpass the state-of-the-art by over 3 dB in PSNR and more than 200 times in rendering speed.

Dynamic 3D Gaussian Fields for Urban Areas

TL;DR

Dynamic novel-view synthesis in large-scale urban environments is addressed with 4DGF, a hybrid representation that uses 3D Gaussian geometry as a scaffold, neural fields for compact appearance and transient geometry, and a scene-graph to model global dynamics and local non-rigid motion. The method scales to tens of thousands of images, supports heterogeneous data, and delivers state-of-the-art view synthesis with interactive rendering speeds across multiple urban benchmarks. It achieves substantial PSNR gains (over 3 dB) and orders-of-magnitude speedups (over 200x, up to 700x in some cases) over previous approaches. This work advances urban digital twins, AR/VR, and robotics simulations by enabling realistic, scalable, and fast dynamic scene reconstruction and rendering.

Abstract

We present an efficient neural 3D scene representation for novel-view synthesis (NVS) in large-scale, dynamic urban areas. Existing works are not well suited for applications like mixed-reality or closed-loop simulation due to their limited visual quality and non-interactive rendering speeds. Recently, rasterization-based approaches have achieved high-quality NVS at impressive speeds. However, these methods are limited to small-scale, homogeneous data, i.e. they cannot handle severe appearance and geometry variations due to weather, season, and lighting and do not scale to larger, dynamic areas with thousands of images. We propose 4DGF, a neural scene representation that scales to large-scale dynamic urban areas, handles heterogeneous input data, and substantially improves rendering speeds. We use 3D Gaussians as an efficient geometry scaffold while relying on neural fields as a compact and flexible appearance model. We integrate scene dynamics via a scene graph at global scale while modeling articulated motions on a local level via deformations. This decomposed approach enables flexible scene composition suitable for real-world applications. In experiments, we surpass the state-of-the-art by over 3 dB in PSNR and more than 200 times in rendering speed.
Paper Structure (17 sections, 16 equations, 10 figures, 9 tables)

This paper contains 17 sections, 16 equations, 10 figures, 9 tables.

Figures (10)

  • Figure 1: Summary. Given a set of heterogeneous input sequences that capture a common geographic area in varying environmental conditions (e.g. weather, season, and lighting) with distinct dynamic objects (e.g. vehicles, pedestrians, and cyclists), we optimize a single dynamic scene representation that permits rendering of arbitrary viewpoints and scene configurations at interactive speeds.
  • Figure 2: Overview. To render an image of sequence $s$ at time $t$, we first evaluate the scene graph $\mathcal{G} = (\mathcal{V}, \mathcal{E})$ which stores latent codes $\omega$ at its nodes $\mathcal{V}$ and coordinate transformations $[\mathbf{R} | \mathbf{t}]$ at its edges $\mathcal{E}$, i.e. the configuration of the dynamic objects and the overall scene. We then use the scene configuration to determine the active sets of 3D Gaussians $G$. The 3D Gaussians $G$ and the latent codes $\omega$ serve as conditioning signals to the neural fields $\phi$ and $\psi$, which output, for each 3D Gaussian $\mathfrak g_k \in G$, an appearance conditioned color $\mathbf{c}^{s, t}_k$, an opacity correction term $\nu^{s, t}_k$ for static Gaussians modeling transient geometry, and a dynamic deformation $\delta^t_k$ for non-rigid dynamic 3D Gaussians modeling e.g. pedestrians. Finally, the retrieved information is used to compose a set of 3D Gaussians that represent the dynamic scene at $(s, t)$ from which we render the image.
  • Figure 3: Qualitative results on Argoverse 2 wilson2023argoverse. Our method produces significantly sharper renderings both in foreground dynamic and static background regions, with much fewer artifacts e.g. in areas with transient geometry such as tree branches (left). Best viewed digitally.
  • Figure 4: Qualitative results on Waymo Open sun2020scalability. We show a sequence of evaluation views synthesized by our model (top-left to bottom-right). As the woman (marked with a red box) gets out of the car and walks away, we successfully model her articulated motion and changing body poses.
  • Figure 5: Qualitative comparison of ADCs. We show an example of a close-up car and observe over-smoothing when using vanilla ADC while our modified ADC leads to a sharper rendering result.
  • ...and 5 more figures