Table of Contents
Fetching ...

Multi-Level Neural Scene Graphs for Dynamic Urban Environments

Tobias Fischer, Lorenzo Porzi, Samuel Rota Bulò, Marc Pollefeys, Peter Kontschieder

TL;DR

This work tackles radiance-field reconstruction for large-scale, dynamic urban environments captured from moving vehicles under varying conditions. It introduces a multi-level neural scene graph that separates static structure, per-sequence variations, and dynamic objects, with two radiance fields φ and ψ conditioned by per-node latents to capture appearance, geometry, and dynamics across time. An efficient composite ray sampling and rendering scheme, along with continuous-time pose modeling and hierarchical pose optimization, enables scalable training and accurate novel view synthesis. A new Argoverse 2-based benchmark is presented, and the method achieves state-of-the-art results on both this benchmark and standard datasets like KITTI and VKITTI2, demonstrating strong performance in dynamic urban settings and potential for city-scale digital twins.

Abstract

We estimate the radiance field of large-scale dynamic areas from multiple vehicle captures under varying environmental conditions. Previous works in this domain are either restricted to static environments, do not scale to more than a single short video, or struggle to separately represent dynamic object instances. To this end, we present a novel, decomposable radiance field approach for dynamic urban environments. We propose a multi-level neural scene graph representation that scales to thousands of images from dozens of sequences with hundreds of fast-moving objects. To enable efficient training and rendering of our representation, we develop a fast composite ray sampling and rendering scheme. To test our approach in urban driving scenarios, we introduce a new, novel view synthesis benchmark. We show that our approach outperforms prior art by a significant margin on both established and our proposed benchmark while being faster in training and rendering.

Multi-Level Neural Scene Graphs for Dynamic Urban Environments

TL;DR

This work tackles radiance-field reconstruction for large-scale, dynamic urban environments captured from moving vehicles under varying conditions. It introduces a multi-level neural scene graph that separates static structure, per-sequence variations, and dynamic objects, with two radiance fields φ and ψ conditioned by per-node latents to capture appearance, geometry, and dynamics across time. An efficient composite ray sampling and rendering scheme, along with continuous-time pose modeling and hierarchical pose optimization, enables scalable training and accurate novel view synthesis. A new Argoverse 2-based benchmark is presented, and the method achieves state-of-the-art results on both this benchmark and standard datasets like KITTI and VKITTI2, demonstrating strong performance in dynamic urban settings and potential for city-scale digital twins.

Abstract

We estimate the radiance field of large-scale dynamic areas from multiple vehicle captures under varying environmental conditions. Previous works in this domain are either restricted to static environments, do not scale to more than a single short video, or struggle to separately represent dynamic object instances. To this end, we present a novel, decomposable radiance field approach for dynamic urban environments. We propose a multi-level neural scene graph representation that scales to thousands of images from dozens of sequences with hundreds of fast-moving objects. To enable efficient training and rendering of our representation, we develop a fast composite ray sampling and rendering scheme. To test our approach in urban driving scenarios, we introduce a new, novel view synthesis benchmark. We show that our approach outperforms prior art by a significant margin on both established and our proposed benchmark while being faster in training and rendering.
Paper Structure (14 sections, 14 equations, 13 figures, 8 tables)

This paper contains 14 sections, 14 equations, 13 figures, 8 tables.

Figures (13)

  • Figure 1: Overview. We represent sequences captured from moving vehicles in a shared geographic area with a multi-level scene graph. Each dynamic object $v_o$ is associated with a sequence node $v_s^t$ and time $t$. The sequence nodes are registered in a common world frame at the root node $v_r$ through the vehicle poses $\mathbf{P}_s^t$, while the dynamic objects are localized w.r.t. the sequence node with pose $\xi_o^t$. Each camera $c$ is associated with an ego-vehicle position, i.e. node $v_s^t$, through the extrinsic calibration $\mathbf{T}_c$. The sequence and object nodes hold latent codes $\omega$ that condition the radiance field, synthesizing novel views in various conditions with distinct dynamic objects.
  • Figure 2: Ego-vehicle trajectories of our benchmark. We show the residential (left) and the downtown (right) areas, trajectories superimposed on 2D maps obtained from OpenStreetMap osm.
  • Figure 3: Sequence alignment visualization. The initial GPS-based alignment is imprecise, as evidenced by the duplicated structures in the overlaid LiDAR point clouds (left). After our ICP alignment, the area is well reconstructed (middle) according to its real geometry (right, from Argoverse 2 wilson2023argoverse).
  • Figure 4: Modifying car appearance with scene appearance. We exchange $\omega_s^t$ for different car instances. The car's appearance in a rendered, car-centric view (top) changes according to the environmental conditions visible in the sequence $s$ (bottom).
  • Figure 5: Composite ray sampling. If a ray intersects with an object $v_o$, we sample from both proposal network $\sigma_\text{prop}$ and radiance field $\psi$, and $\sigma_\text{prop}$ otherwise. We condition each with the latents $\omega$ of the respective nodes. The PDF is a mixture of all node densities that intersect with the ray. The transmittance $U$ drops at the first surface intersection (tree) where further samples will concentrate.
  • ...and 8 more figures