Table of Contents
Fetching ...

RoDUS: Robust Decomposition of Static and Dynamic Elements in Urban Scenes

Thang-Anh-Quan Nguyen, Luis Roldão, Nathan Piasco, Moussab Bennehar, Dzmitry Tsishkou

TL;DR

RoDUS addresses the challenge of disentangling static and dynamic elements in large-scale urban scenes for NeRF-based rendering without extensive motion cues. It introduces a two-pathway NeRF with separate static and dynamic radiance fields, a 4D hash grid, and a semantic radiance field guided by a foreground-only mask to promote accurate decomposition. The training leverages a robust IRLS-based loss, sky and road regularization, and a bootstrapping strategy to stabilize optimization, achieving superior static background reconstruction and dynamic-object segmentation on KITTI-360 and Pandaset. The combination of robust initialization, semantic guidance, and targeted regularization yields improved decomposition quality and multi-view consistency, with implications for autonomous driving and urban scene understanding.

Abstract

The task of separating dynamic objects from static environments using NeRFs has been widely studied in recent years. However, capturing large-scale scenes still poses a challenge due to their complex geometric structures and unconstrained dynamics. Without the help of 3D motion cues, previous methods often require simplified setups with slow camera motion and only a few/single dynamic actors, leading to suboptimal solutions in most urban setups. To overcome such limitations, we present RoDUS, a pipeline for decomposing static and dynamic elements in urban scenes, with thoughtfully separated NeRF models for moving and non-moving components. Our approach utilizes a robust kernel-based initialization coupled with 4D semantic information to selectively guide the learning process. This strategy enables accurate capturing of the dynamics in the scene, resulting in reduced floating artifacts in the reconstructed background, all by using self-supervision. Notably, experimental evaluations on KITTI-360 and Pandaset datasets demonstrate the effectiveness of our method in decomposing challenging urban scenes into precise static and dynamic components.

RoDUS: Robust Decomposition of Static and Dynamic Elements in Urban Scenes

TL;DR

RoDUS addresses the challenge of disentangling static and dynamic elements in large-scale urban scenes for NeRF-based rendering without extensive motion cues. It introduces a two-pathway NeRF with separate static and dynamic radiance fields, a 4D hash grid, and a semantic radiance field guided by a foreground-only mask to promote accurate decomposition. The training leverages a robust IRLS-based loss, sky and road regularization, and a bootstrapping strategy to stabilize optimization, achieving superior static background reconstruction and dynamic-object segmentation on KITTI-360 and Pandaset. The combination of robust initialization, semantic guidance, and targeted regularization yields improved decomposition quality and multi-view consistency, with implications for autonomous driving and urban scene understanding.

Abstract

The task of separating dynamic objects from static environments using NeRFs has been widely studied in recent years. However, capturing large-scale scenes still poses a challenge due to their complex geometric structures and unconstrained dynamics. Without the help of 3D motion cues, previous methods often require simplified setups with slow camera motion and only a few/single dynamic actors, leading to suboptimal solutions in most urban setups. To overcome such limitations, we present RoDUS, a pipeline for decomposing static and dynamic elements in urban scenes, with thoughtfully separated NeRF models for moving and non-moving components. Our approach utilizes a robust kernel-based initialization coupled with 4D semantic information to selectively guide the learning process. This strategy enables accurate capturing of the dynamics in the scene, resulting in reduced floating artifacts in the reconstructed background, all by using self-supervision. Notably, experimental evaluations on KITTI-360 and Pandaset datasets demonstrate the effectiveness of our method in decomposing challenging urban scenes into precise static and dynamic components.
Paper Structure (27 sections, 11 equations, 16 figures, 4 tables)

This paper contains 27 sections, 11 equations, 16 figures, 4 tables.

Figures (16)

  • Figure 1: RoDUS is a neural scene representation designed to decompose 4D dynamic scenes into two elements: moving foreground and static background. This decomposition remains consistent across all photometric, geometric, and semantic aspects.
  • Figure 2: RoDUS's architecture. Our model comprises two separate branches that take as input sampled positions $\mathbf{x}$, viewing direction $\mathbf{d}$, and their timestamp $t$, and generate outputs for every query coordinate. Each branch is represented by a separate hash grid and the MLP-based neural function, which predicts colors, densities, and semantics. The rendered static RGB is used to calculate IRLS map $\omega(\epsilon)$ during the robust initialization step (\ref{['sec:loss']}), while dynamic semantic outputs are passed through a "foreground-only mask" to prevent over-explaining background regions (\ref{['sec:semantic']}).
  • Figure 2: Ablation study. Impact of our design choice on static NVS task.
  • Figure 3: (a) We enforce a "foreground-only mask" to the dynamic semantic head. In the forward pass (black arrows), the mask prevents the dynamic branch from outputting pixels that do not belong to foreground classes. While in the backward pass (orange arrows), it restricts the dynamic field from learning background pixels, which may primarily come from noisy annotations. (b) Predictions generated by 2D segmentation model are noisy and inconsistent between views. (c) Since the dynamic field includes a temporal dimension, it ends up learning these conflicting labels, satisfying the overall loss. (d) Therefore, our proposed mask is used to tackle the problem.
  • Figure 4: (b) Using a trimmed kernel effectively removes all moving cars from the scene. (c, d, e) As the sky and road regions are badly reconstructed, applying class constraints can aid the learning process of the static model.
  • ...and 11 more figures