OSN: Infinite Representations of Dynamic 3D Scenes from Monocular Videos
Ziyang Song, Jinxi Li, Bo Yang
TL;DR
OSN addresses the ill-posed problem of reconstructing dynamic 3D scenes from a single monocular video by proposing an object scale-invariant representation paired with an Object Scale Network that learns per-object scale ranges. Through scaled composite rendering and a soft Z-buffer-based supervision, OSN jointly optimizes per-object representations and scales, enabling sampling of infinitely many faithful scene configurations from the same video. Empirical results on synthetic and real datasets show OSN outperforms single-solution baselines in dynamic novel-view synthesis and depth fidelity, while also revealing the learned scale ranges through qualitative and quantitative analyses. This approach broadens the understanding of monocular dynamic reconstruction and enables generation of multiple plausible 3D scene hypotheses from a single observation.
Abstract
It has long been challenging to recover the underlying dynamic 3D scene representations from a monocular RGB video. Existing works formulate this problem into finding a single most plausible solution by adding various constraints such as depth priors and strong geometry constraints, ignoring the fact that there could be infinitely many 3D scene representations corresponding to a single dynamic video. In this paper, we aim to learn all plausible 3D scene configurations that match the input video, instead of just inferring a specific one. To achieve this ambitious goal, we introduce a new framework, called OSN. The key to our approach is a simple yet innovative object scale network together with a joint optimization module to learn an accurate scale range for every dynamic 3D object. This allows us to sample as many faithful 3D scene configurations as possible. Extensive experiments show that our method surpasses all baselines and achieves superior accuracy in dynamic novel view synthesis on multiple synthetic and real-world datasets. Most notably, our method demonstrates a clear advantage in learning fine-grained 3D scene geometry. Our code and data are available at https://github.com/vLAR-group/OSN
