Table of Contents
Fetching ...

OSN: Infinite Representations of Dynamic 3D Scenes from Monocular Videos

Ziyang Song, Jinxi Li, Bo Yang

TL;DR

OSN addresses the ill-posed problem of reconstructing dynamic 3D scenes from a single monocular video by proposing an object scale-invariant representation paired with an Object Scale Network that learns per-object scale ranges. Through scaled composite rendering and a soft Z-buffer-based supervision, OSN jointly optimizes per-object representations and scales, enabling sampling of infinitely many faithful scene configurations from the same video. Empirical results on synthetic and real datasets show OSN outperforms single-solution baselines in dynamic novel-view synthesis and depth fidelity, while also revealing the learned scale ranges through qualitative and quantitative analyses. This approach broadens the understanding of monocular dynamic reconstruction and enables generation of multiple plausible 3D scene hypotheses from a single observation.

Abstract

It has long been challenging to recover the underlying dynamic 3D scene representations from a monocular RGB video. Existing works formulate this problem into finding a single most plausible solution by adding various constraints such as depth priors and strong geometry constraints, ignoring the fact that there could be infinitely many 3D scene representations corresponding to a single dynamic video. In this paper, we aim to learn all plausible 3D scene configurations that match the input video, instead of just inferring a specific one. To achieve this ambitious goal, we introduce a new framework, called OSN. The key to our approach is a simple yet innovative object scale network together with a joint optimization module to learn an accurate scale range for every dynamic 3D object. This allows us to sample as many faithful 3D scene configurations as possible. Extensive experiments show that our method surpasses all baselines and achieves superior accuracy in dynamic novel view synthesis on multiple synthetic and real-world datasets. Most notably, our method demonstrates a clear advantage in learning fine-grained 3D scene geometry. Our code and data are available at https://github.com/vLAR-group/OSN

OSN: Infinite Representations of Dynamic 3D Scenes from Monocular Videos

TL;DR

OSN addresses the ill-posed problem of reconstructing dynamic 3D scenes from a single monocular video by proposing an object scale-invariant representation paired with an Object Scale Network that learns per-object scale ranges. Through scaled composite rendering and a soft Z-buffer-based supervision, OSN jointly optimizes per-object representations and scales, enabling sampling of infinitely many faithful scene configurations from the same video. Empirical results on synthetic and real datasets show OSN outperforms single-solution baselines in dynamic novel-view synthesis and depth fidelity, while also revealing the learned scale ranges through qualitative and quantitative analyses. This approach broadens the understanding of monocular dynamic reconstruction and enables generation of multiple plausible 3D scene hypotheses from a single observation.

Abstract

It has long been challenging to recover the underlying dynamic 3D scene representations from a monocular RGB video. Existing works formulate this problem into finding a single most plausible solution by adding various constraints such as depth priors and strong geometry constraints, ignoring the fact that there could be infinitely many 3D scene representations corresponding to a single dynamic video. In this paper, we aim to learn all plausible 3D scene configurations that match the input video, instead of just inferring a specific one. To achieve this ambitious goal, we introduce a new framework, called OSN. The key to our approach is a simple yet innovative object scale network together with a joint optimization module to learn an accurate scale range for every dynamic 3D object. This allows us to sample as many faithful 3D scene configurations as possible. Extensive experiments show that our method surpasses all baselines and achieves superior accuracy in dynamic novel view synthesis on multiple synthetic and real-world datasets. Most notably, our method demonstrates a clear advantage in learning fine-grained 3D scene geometry. Our code and data are available at https://github.com/vLAR-group/OSN
Paper Structure (25 sections, 19 equations, 22 figures, 10 tables)

This paper contains 25 sections, 19 equations, 22 figures, 10 tables.

Figures (22)

  • Figure 1: An illustration of multiple correct 3D scene configurations that match the same dynamic monocular video.
  • Figure 2: An illustration of our framework. Given a dynamic video as input, our Object Scale-invariant Representation module (the blue block) and the Object Scale Network (the orange block) aim to represent all faithful 3D scene representations, allowing infinitely sampling of different 3D scenes (the rightmost block) after they are jointly optimized. Circles highlight the differences between the two scenes.
  • Figure 3: The yellow block shows that the input video will first be preprocessed into per-object information. After that, the shape and appearance of each dynamic object will be separately represented by a scale-invariant TensoRF model as shown by the light blue block.
  • Figure 4: An illustration of our object scale network.
  • Figure 5: Qualitative results of dynamic novel view RGB/depth synthesis on the Dynamic Indoor Scene and Oxford Multimotion Datasets.
  • ...and 17 more figures