Table of Contents
Fetching ...

Dynamic Scene Understanding through Object-Centric Voxelization and Neural Rendering

Yanpeng Zhao, Yiwei Hao, Siyu Gao, Yunbo Wang, Xiaokang Yang

TL;DR

DynaVol-S addresses unsupervised 3D dynamic scene decomposition from monocular video by introducing object-centric 4D voxel grids and a canonical-space deformation mechanism within a differentiable NeRF renderer. It couples per-object occupancy with global semantics through a semantic volume slot attention module, enabling explicit object geometries and improved decomposition while supporting scene editing via voxel manipulation and trajectory control. The approach uses a three-stage training pipeline (warmup, voxel initialization, and joint optimization) to jointly optimize geometry, appearance, and semantics, achieving state-of-the-art performance in novel view synthesis and unsupervised decomposition on both synthetic and real-world data. This work advances 3D scene understanding by enabling explicit object-level control and editing, offering practical benefits for downstream tasks in vision and simulation.

Abstract

Learning object-centric representations from unsupervised videos is challenging. Unlike most previous approaches that focus on decomposing 2D images, we present a 3D generative model named DynaVol-S for dynamic scenes that enables object-centric learning within a differentiable volume rendering framework. The key idea is to perform object-centric voxelization to capture the 3D nature of the scene, which infers per-object occupancy probabilities at individual spatial locations. These voxel features evolve through a canonical-space deformation function and are optimized in an inverse rendering pipeline with a compositional NeRF. Additionally, our approach integrates 2D semantic features to create 3D semantic grids, representing the scene through multiple disentangled voxel grids. DynaVol-S significantly outperforms existing models in both novel view synthesis and unsupervised decomposition tasks for dynamic scenes. By jointly considering geometric structures and semantic features, it effectively addresses challenging real-world scenarios involving complex object interactions. Furthermore, once trained, the explicitly meaningful voxel features enable additional capabilities that 2D scene decomposition methods cannot achieve, such as novel scene generation through editing geometric shapes or manipulating the motion trajectories of objects.

Dynamic Scene Understanding through Object-Centric Voxelization and Neural Rendering

TL;DR

DynaVol-S addresses unsupervised 3D dynamic scene decomposition from monocular video by introducing object-centric 4D voxel grids and a canonical-space deformation mechanism within a differentiable NeRF renderer. It couples per-object occupancy with global semantics through a semantic volume slot attention module, enabling explicit object geometries and improved decomposition while supporting scene editing via voxel manipulation and trajectory control. The approach uses a three-stage training pipeline (warmup, voxel initialization, and joint optimization) to jointly optimize geometry, appearance, and semantics, achieving state-of-the-art performance in novel view synthesis and unsupervised decomposition on both synthetic and real-world data. This work advances 3D scene understanding by enabling explicit object-level control and editing, offering practical benefits for downstream tasks in vision and simulation.

Abstract

Learning object-centric representations from unsupervised videos is challenging. Unlike most previous approaches that focus on decomposing 2D images, we present a 3D generative model named DynaVol-S for dynamic scenes that enables object-centric learning within a differentiable volume rendering framework. The key idea is to perform object-centric voxelization to capture the 3D nature of the scene, which infers per-object occupancy probabilities at individual spatial locations. These voxel features evolve through a canonical-space deformation function and are optimized in an inverse rendering pipeline with a compositional NeRF. Additionally, our approach integrates 2D semantic features to create 3D semantic grids, representing the scene through multiple disentangled voxel grids. DynaVol-S significantly outperforms existing models in both novel view synthesis and unsupervised decomposition tasks for dynamic scenes. By jointly considering geometric structures and semantic features, it effectively addresses challenging real-world scenarios involving complex object interactions. Furthermore, once trained, the explicitly meaningful voxel features enable additional capabilities that 2D scene decomposition methods cannot achieve, such as novel scene generation through editing geometric shapes or manipulating the motion trajectories of objects.
Paper Structure (21 sections, 12 equations, 20 figures, 12 tables, 1 algorithm)

This paper contains 21 sections, 12 equations, 20 figures, 12 tables, 1 algorithm.

Figures (20)

  • Figure 1: DynaVol-S explores unsupervised object-centric decomposition in 3D dynamic scenes within an inverse rendering framework. Unlike previous canonical-space neural rendering techniques, such as TineuVox fang2022fast, our approach integrates voxelized object-centric representations, which achieves an explicit understanding of the object geometries and physical interactions in the dynamic scenes and further facilitates downstream tasks such as scene editing.
  • Figure 2: DynaVol-S consists of three groups of network components: the bi-directional dynamics modules ($f_\psi$, $f_\xi^\prime$), the volume slot attention module, and neural rendering modules based on 3D and 4D voxels respectively. The training scheme involves a warmup stage (top), an object-centric voxel grids initialization stage whose pseudocode is given in Alg. \ref{['alg:initV']}, and a multi-grids joint optimization stage (bottom). For clarity, we denote $\mathcal{V}^{\text{Color}}$ and $\mathcal{V}^{\text{Opac}}$ collectively as $\mathcal{V}^{\text{C/O}}$ and present the symbols and their descriptions in Table \ref{['tab:notation']}.
  • Figure 3: Novel view synthesis results. D-NeRF fails in the left synthetic examples on 3ObjRand. In the right real-world scenes, DynaVol-S significantly outperforms the prior art, TineuVox, by learning object-centric features.
  • Figure 4: Visualization of scene decomposition results for each object in the synthetic 6ObjFall scene.
  • Figure 5: Real-world scene decomposition results for Chicken and Torchocolates.
  • ...and 15 more figures