Table of Contents
Fetching ...

VDNeRF: Vision-only Dynamic Neural Radiance Field for Urban Scenes

Zhengyu Zou, Jingfeng Li, Hao Li, Xiaolei Hou, Jinwen Hu, Jingkun Chen, Lechao Cheng, Dingwen Zhang

TL;DR

VDNeRF addresses the challenge of reconstructing dynamic urban scenes without camera poses by jointly learning camera trajectories and a spatiotemporal scene representation through two NeRFs for static and dynamic content. It introduces a flow-informed dynamic NeRF and a shadow-based fusion mechanism, all within a progressive sub-scene training framework that enables self-supervised static-dynamic decomposition. The approach achieves state-of-the-art results on NOTR and Pandaset for both novel view synthesis and pose estimation, demonstrating robust perception in pose-free urban scenarios. This work advances practical vision-based perception for autonomous driving and robotics by removing reliance on external pose data while delivering high-quality dynamic reconstructions.

Abstract

Neural Radiance Fields (NeRFs) implicitly model continuous three-dimensional scenes using a set of images with known camera poses, enabling the rendering of photorealistic novel views. However, existing NeRF-based methods encounter challenges in applications such as autonomous driving and robotic perception, primarily due to the difficulty of capturing accurate camera poses and limitations in handling large-scale dynamic environments. To address these issues, we propose Vision-only Dynamic NeRF (VDNeRF), a method that accurately recovers camera trajectories and learns spatiotemporal representations for dynamic urban scenes without requiring additional camera pose information or expensive sensor data. VDNeRF employs two separate NeRF models to jointly reconstruct the scene. The static NeRF model optimizes camera poses and static background, while the dynamic NeRF model incorporates the 3D scene flow to ensure accurate and consistent reconstruction of dynamic objects. To address the ambiguity between camera motion and independent object motion, we design an effective and powerful training framework to achieve robust camera pose estimation and self-supervised decomposition of static and dynamic elements in a scene. Extensive evaluations on mainstream urban driving datasets demonstrate that VDNeRF surpasses state-of-the-art NeRF-based pose-free methods in both camera pose estimation and dynamic novel view synthesis.

VDNeRF: Vision-only Dynamic Neural Radiance Field for Urban Scenes

TL;DR

VDNeRF addresses the challenge of reconstructing dynamic urban scenes without camera poses by jointly learning camera trajectories and a spatiotemporal scene representation through two NeRFs for static and dynamic content. It introduces a flow-informed dynamic NeRF and a shadow-based fusion mechanism, all within a progressive sub-scene training framework that enables self-supervised static-dynamic decomposition. The approach achieves state-of-the-art results on NOTR and Pandaset for both novel view synthesis and pose estimation, demonstrating robust perception in pose-free urban scenarios. This work advances practical vision-based perception for autonomous driving and robotics by removing reliance on external pose data while delivering high-quality dynamic reconstructions.

Abstract

Neural Radiance Fields (NeRFs) implicitly model continuous three-dimensional scenes using a set of images with known camera poses, enabling the rendering of photorealistic novel views. However, existing NeRF-based methods encounter challenges in applications such as autonomous driving and robotic perception, primarily due to the difficulty of capturing accurate camera poses and limitations in handling large-scale dynamic environments. To address these issues, we propose Vision-only Dynamic NeRF (VDNeRF), a method that accurately recovers camera trajectories and learns spatiotemporal representations for dynamic urban scenes without requiring additional camera pose information or expensive sensor data. VDNeRF employs two separate NeRF models to jointly reconstruct the scene. The static NeRF model optimizes camera poses and static background, while the dynamic NeRF model incorporates the 3D scene flow to ensure accurate and consistent reconstruction of dynamic objects. To address the ambiguity between camera motion and independent object motion, we design an effective and powerful training framework to achieve robust camera pose estimation and self-supervised decomposition of static and dynamic elements in a scene. Extensive evaluations on mainstream urban driving datasets demonstrate that VDNeRF surpasses state-of-the-art NeRF-based pose-free methods in both camera pose estimation and dynamic novel view synthesis.

Paper Structure

This paper contains 17 sections, 13 equations, 11 figures, 3 tables.

Figures (11)

  • Figure 1: Novel view synthesis for dynamic urban scenes. Given only a set of images without known camera poses, VDNeRF can recover the camera trajectory and reconstruct the spatiotemporal scene. Compared with existing methods, VDNeRF demonstrates more robust camera pose estimation, higher-quality dynamic novel view synthesis, and more precise static-dynamic decomposition. We zoom in on the dynamic objects within the red box in the upper right corner to provide a more detailed comparison.
  • Figure 2: VDNeRF Overview. VDNeRF is composed of a static NeRF $F_{\Theta}^{s}$ and a dynamic NeRF $F_{\Theta}^{d}$ with a flow field. Each model is equipped with a hash grid and a base MLP. The static NeRF $F_{\Theta}^{s}$ takes the 3D location of the sampling points $\mathbf{x}=(x, y, z)$ as input, while dynamic NeRF $F_{\Theta}^{d}$ additionally incorporates a timestep $t$ as input. The static NeRF $F_{\Theta}^{s}$ query feature $\mathcal{F}_s$ and density $\sigma_s$ from the base MLP. For the dynamic NeRF $F_{\Theta}^{d}$, it leverages the 3D forward flow $\mathbf{v}_f$ and backward flow $\mathbf{v}_b$–with the 2D optical flow visualized in the figure–predicted by the flow field to aggregate dynamic feature $\hat{\mathcal{F}_d^t}$ and dynamic density $\hat{\sigma_d^t}$ across multiple timesteps and queries the shadow weight $\rho \in [0,1]$ from the shadow MLP. These features, along with the viewing direction $\mathbf{d}=(\theta, \phi)$, are passed to the color MLP to obtain the color $\mathbf{c}=(r, g, b)$. Volume rendering is applied to all sampling points along the camera ray to render static and dynamic pixel colors $\hat{C}_s(\mathbf{r})$ and $\hat{C}_d(\mathbf{r})$. Finally, the blended representation of spatiotemporal scenes is achieved using shadow weights. During training, VDNeRF partitions the scene into multiple overlapping sub-scenes. For each sub-scene, it starts with a small set of images, progressively optimizing the static NeRF $F_{\Theta}^{s}$ and camera poses $P_i$ with the help of motion masks. Once accurate camera poses are established, they are fixed, and the dynamic NeRF $F_{\Theta}^{d}$ is activated to reconstruct dynamic objects, achieving self-supervised static-dynamic decomposition.
  • Figure 3: Histograms of shadow weight distributions in different scenes. The left column shows the synthesized RGB images, while the right column presents histograms of the shadow weight distributions corresponding to the pixels. The y-axis of each histogram is plotted on a logarithmic scale to better visualize the distribution. We also annotate each histogram with the exact counts and corresponding percentages. Please consider zooming in to view the detailed numbers.
  • Figure 4: Motion mask from RoDynRF. The motion masks generated by RoDynRF include stationary vehicles and pedestrians on the roadside. We also do not use it to supervise static-dynamic decomposition but only use it to exclude dynamic elements to achieve more accurate and robust camera pose estimation.
  • Figure 5: Qualitative results of dynamic novel view synthesis on the NOTR and Pandaset datasets. We visualize the images rendered from novel viewpoints. Compared to other methods, VDNeRF accurately reconstructs dynamic objects and synthesizes more detailed and photorealistic images.
  • ...and 6 more figures