Predicting 3D representations for Dynamic Scenes
Di Qi, Tong Yang, Beining Wang, Xiangyu Zhang, Wenqiang Zhang
TL;DR
This paper addresses the challenge of 4D physical world modeling by predicting explicit 3D representations at future times from monocular video. It introduces an ego-centric unbounded triplane to represent dynamic scenes and a 4D-aware transformer to update this representation using sequences of monocular frames, trained end-to-end with a NeRF-based decoder and a temporal 3D constraint. The approach demonstrates strong 3D future-prediction capabilities on NVIDIA Dynamic Scenes and generalizes well to unseen domains like DAVIS, while revealing emergent geometry and semantic learning from self-supervised training. The combination of a compact, view-centered 3D representation and cross-time feature aggregation enables robust 4D scene understanding and suggests a path toward next-3D prediction for spatial intelligence.
Abstract
We present a novel framework for dynamic radiance field prediction given monocular video streams. Unlike previous methods that primarily focus on predicting future frames, our method goes a step further by generating explicit 3D representations of the dynamic scene. The framework builds on two core designs. First, we adopt an ego-centric unbounded triplane to explicitly represent the dynamic physical world. Second, we develop a 4D-aware transformer to aggregate features from monocular videos to update the triplane. Coupling these two designs enables us to train the proposed model with large-scale monocular videos in a self-supervised manner. Our model achieves top results in dynamic radiance field prediction on NVIDIA dynamic scenes, demonstrating its strong performance on 4D physical world modeling. Besides, our model shows a superior generalizability to unseen scenarios. Notably, we find that our approach emerges capabilities for geometry and semantic learning.
