Table of Contents
Fetching ...

ProSGNeRF: Progressive Dynamic Neural Scene Graph with Frequency Modulated Foundation Model in Urban Scenes

Tianchen Deng, Yanbo Wang, Yejia Liu, Chenpeng Su, Jingchuan Wang, Danwei Wang, Shao-Yuan Lo, Weidong Chen

TL;DR

ProSGNeRF addresses the challenge of reconstructing and rendering large-scale urban scenes with multiple dynamic objects under significant ego-motion. It introduces a Progressive Neural Scene Graph that decouples dynamic objects, background, and far-field, augmented by a frequency-modulated foundation model and DINOv2 priors to handle sparse observations. The approach achieves state-of-the-art view synthesis, supports object manipulation and scene roaming, and scales to city-scale sequences through local, overlapping scene graphs and robust geometry regularization. This yields practical impact for autonomous driving, city-scale simulation, and AR/VR visualization in dynamic urban environments.

Abstract

Implicit neural representation has demonstrated promising results in 3D reconstruction on various scenes. However, existing approaches either struggle to model fast-moving objects or are incapable of handling large-scale camera ego-motions in urban environments. This leads to low-quality synthesized views of the large-scale urban scenes. In this paper, we aim to jointly solve the problems caused by large-scale scenes and fast-moving vehicles, which are more practical and challenging. To this end, we propose a progressive scene graph network architecture to learn the local scene representations of dynamic objects and global urban scenes. The progressive learning architecture dynamically allocates a new local scene graph trained on frames within a temporal window, with the window size automatically determined, allowing us to scale up the representation to arbitrarily large scenes. Besides, according to our observations, the training views of dynamic objects are relatively sparse according to rapid movements, which leads to a significant decline in reconstruction accuracy for dynamic objects. Therefore, we utilize a foundation model network to encode the latent code. Specifically, we leverage the generalization capability of the visual foundation model DINOv2 to extract appearance and shape codes, and train the network on a large-scale urban scene object dataset to enhance its prior modeling ability for handling sparse-view dynamic inputs. In parallel, we introduce a frequency-modulated module that regularizes the frequency spectrum of objects, thereby addressing the challenge of modeling sparse image inputs from a frequency-domain perspective. Experimental results demonstrate that our method achieves state-of-the-art view synthesis accuracy, object manipulation, and scene roaming ability in various scenes.

ProSGNeRF: Progressive Dynamic Neural Scene Graph with Frequency Modulated Foundation Model in Urban Scenes

TL;DR

ProSGNeRF addresses the challenge of reconstructing and rendering large-scale urban scenes with multiple dynamic objects under significant ego-motion. It introduces a Progressive Neural Scene Graph that decouples dynamic objects, background, and far-field, augmented by a frequency-modulated foundation model and DINOv2 priors to handle sparse observations. The approach achieves state-of-the-art view synthesis, supports object manipulation and scene roaming, and scales to city-scale sequences through local, overlapping scene graphs and robust geometry regularization. This yields practical impact for autonomous driving, city-scale simulation, and AR/VR visualization in dynamic urban environments.

Abstract

Implicit neural representation has demonstrated promising results in 3D reconstruction on various scenes. However, existing approaches either struggle to model fast-moving objects or are incapable of handling large-scale camera ego-motions in urban environments. This leads to low-quality synthesized views of the large-scale urban scenes. In this paper, we aim to jointly solve the problems caused by large-scale scenes and fast-moving vehicles, which are more practical and challenging. To this end, we propose a progressive scene graph network architecture to learn the local scene representations of dynamic objects and global urban scenes. The progressive learning architecture dynamically allocates a new local scene graph trained on frames within a temporal window, with the window size automatically determined, allowing us to scale up the representation to arbitrarily large scenes. Besides, according to our observations, the training views of dynamic objects are relatively sparse according to rapid movements, which leads to a significant decline in reconstruction accuracy for dynamic objects. Therefore, we utilize a foundation model network to encode the latent code. Specifically, we leverage the generalization capability of the visual foundation model DINOv2 to extract appearance and shape codes, and train the network on a large-scale urban scene object dataset to enhance its prior modeling ability for handling sparse-view dynamic inputs. In parallel, we introduce a frequency-modulated module that regularizes the frequency spectrum of objects, thereby addressing the challenge of modeling sparse image inputs from a frequency-domain perspective. Experimental results demonstrate that our method achieves state-of-the-art view synthesis accuracy, object manipulation, and scene roaming ability in various scenes.
Paper Structure (18 sections, 15 equations, 11 figures, 8 tables, 1 algorithm)

This paper contains 18 sections, 15 equations, 11 figures, 8 tables, 1 algorithm.

Figures (11)

  • Figure 1: Urban scene reconstruction and editing with ProSGNeRF. We show our view synthesis in different time steps (65,262) and scene decomposition results. Our approach significantly improves the view synthesis performance in real-world urban scenes containing multiple dynamic objects and large-scale camera ego-motion. We highlight and enlarge objects in the first column of images, providing the corresponding object PSNR in the top left corner. Scene PSNR is provided in the second column.
  • Figure 2: The isometric view of the proposed method, ProSGNeRF. We employed a 2D segmentation network, SAM, to preprocess the training data and generate accurate masks for dynamic objects. We propose a progressive neural scene graph architecture that dynamically allocates local neural scene graph (box). The entire scene is decomposed into three parts: background, dynamic objects, and far-field. We design separate networks for background and objects and introduce a far-field loss for regularization. Nodes $l_i$ represent individual dynamic objects. $F_{bkg}$ models the static background scene and $F_{obj}$ models movable foreground objects in local object-centric coordinate frames.
  • Figure 3: The isometric view of the proposed progressive neural scene graph. This scene representation uses a progressive scheme that dynamically allocates a local neural scene graph (box) based on the camera pose. Adjacent local scene graphs have overlapping regions to maintain global consistency. The leaf nodes $l$ are visualized as boxes with their local Cartesian coordinate axis. We also visualize the transformations and scaling between root and leaf coordinate frames using arrows with annotated transformation and scale matrices. The overall representation model is denoted as $F_{\theta}$.
  • Figure 4: Ray-Box Intersection. We use AABB ray-box sampling strategy. The boxes are now defined as an axis-aligned bounding box (AABB) with a minimum bound [-1,-1,-1] and a maximum bound [1, 1,1].
  • Figure 5: Qualitative results on reconstruction and novel scene arrangements of a scene from the KITTI dataset kitti for NeRF NeRF, NSG nsg, SUDS suds, MARS mars, PVG pvg and our method. From left to right, these images correspond to different timesteps captured in the dynamic scene. We place the PSNR values of each scene in the bottom right corner. NeRF and SUDS are limited in their ability to properly represent the dynamic parts of the scene. In contrast, our neural scene graph method achieves high-quality view synthesis results for both reconstruction and novel scene synthesis, regardless of scene dynamics.
  • ...and 6 more figures