Table of Contents
Fetching ...

Predicting 3D representations for Dynamic Scenes

Di Qi, Tong Yang, Beining Wang, Xiangyu Zhang, Wenqiang Zhang

TL;DR

This paper addresses the challenge of 4D physical world modeling by predicting explicit 3D representations at future times from monocular video. It introduces an ego-centric unbounded triplane to represent dynamic scenes and a 4D-aware transformer to update this representation using sequences of monocular frames, trained end-to-end with a NeRF-based decoder and a temporal 3D constraint. The approach demonstrates strong 3D future-prediction capabilities on NVIDIA Dynamic Scenes and generalizes well to unseen domains like DAVIS, while revealing emergent geometry and semantic learning from self-supervised training. The combination of a compact, view-centered 3D representation and cross-time feature aggregation enables robust 4D scene understanding and suggests a path toward next-3D prediction for spatial intelligence.

Abstract

We present a novel framework for dynamic radiance field prediction given monocular video streams. Unlike previous methods that primarily focus on predicting future frames, our method goes a step further by generating explicit 3D representations of the dynamic scene. The framework builds on two core designs. First, we adopt an ego-centric unbounded triplane to explicitly represent the dynamic physical world. Second, we develop a 4D-aware transformer to aggregate features from monocular videos to update the triplane. Coupling these two designs enables us to train the proposed model with large-scale monocular videos in a self-supervised manner. Our model achieves top results in dynamic radiance field prediction on NVIDIA dynamic scenes, demonstrating its strong performance on 4D physical world modeling. Besides, our model shows a superior generalizability to unseen scenarios. Notably, we find that our approach emerges capabilities for geometry and semantic learning.

Predicting 3D representations for Dynamic Scenes

TL;DR

This paper addresses the challenge of 4D physical world modeling by predicting explicit 3D representations at future times from monocular video. It introduces an ego-centric unbounded triplane to represent dynamic scenes and a 4D-aware transformer to update this representation using sequences of monocular frames, trained end-to-end with a NeRF-based decoder and a temporal 3D constraint. The approach demonstrates strong 3D future-prediction capabilities on NVIDIA Dynamic Scenes and generalizes well to unseen domains like DAVIS, while revealing emergent geometry and semantic learning from self-supervised training. The combination of a compact, view-centered 3D representation and cross-time feature aggregation enables robust 4D scene understanding and suggests a path toward next-3D prediction for spatial intelligence.

Abstract

We present a novel framework for dynamic radiance field prediction given monocular video streams. Unlike previous methods that primarily focus on predicting future frames, our method goes a step further by generating explicit 3D representations of the dynamic scene. The framework builds on two core designs. First, we adopt an ego-centric unbounded triplane to explicitly represent the dynamic physical world. Second, we develop a 4D-aware transformer to aggregate features from monocular videos to update the triplane. Coupling these two designs enables us to train the proposed model with large-scale monocular videos in a self-supervised manner. Our model achieves top results in dynamic radiance field prediction on NVIDIA dynamic scenes, demonstrating its strong performance on 4D physical world modeling. Besides, our model shows a superior generalizability to unseen scenarios. Notably, we find that our approach emerges capabilities for geometry and semantic learning.

Paper Structure

This paper contains 51 sections, 4 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Overview of our method. The process starts with an image encoder that extracts 2D image features from the the video frames, serving as a prior. These features are then processed by a 4D-aware transformer, which updates a learnable triplane representation. Next, an upsampler refines and enlarges the triplane. During training, a NeRF decoder generates target views via volumetric rendering and photometric loss is applied to optimize the model. The FFN module in the transformer is hidden in the figure.
  • Figure 2: Temporal-aware View-Attention and Axis-Attention modules in Transformer.(a) Temporal-aware View-Attention Module: At the target time $t=S+1$, 3D virtual points are uniformly sampled within the triplane. For a given point $\mathbf{x}_{i,j,k}$, it is projected along the three axes onto the triplane features $\mathbf{T}_{xy}, \mathbf{T}_{yz}$, and $\mathbf{T}_{xz}$ to obtain the corresponding 3D query feature $\mathbf{q}_{i,j,k}$. Simultaneously, $\mathbf{x}_{i,j,k}$ is mapped onto image feature maps to obtain epipolar features $\{\mathbf{f}_t\}_{t=1}^{S}$ from source video frames. The temporal-aware view-attention module then integrates these epipolar features across different time points $\{t\}_{t=1}^{S}$, producing an updated 3D query feature $\hat{\mathbf{q}}_{i,j,k}$. (b) Axis-Attention Module: For a triplane feature $\mathbf{p}_{i,j}$ at pixel $(i, j)$ located in plane $\mathbf{T}_{xy}$, it is associates with point features along the z-axis $\{\hat{\mathbf{q}}_{i,j,k}\}_{k=1}^L$. The axis-attention module aggregates these point features to generate a refined triplane feature $\hat{\mathbf{p}}_{i, j}$.
  • Figure 3: Qualitative Comparison on the NVIDIA Dynamic Scenes Dataset. Our method significantly outperforms GNT and PGDVS$^{\dagger}$ in both dynamic and static content. For dynamic objects (first two columns), our approach delivers precise motion and avoids motion blur compared with PGDVS$^{\dagger}$. For static contents (last two columns), our method shows a clear background, whereas PGDVS$^{\dagger}$ and GNT result in blurred backgrounds due to limited depth priors.
  • Figure 4: Qualitative Comparison on the DAVIS Dataset. Our model produces high-quality novel views on DAVIS dataset, including indoor, outdoor, dynamic, and static settings. Notably, the blackswan (last column) is not present in our training data, showing the robust generalization capabilities of our method.
  • Figure 5: Reconstructed depth maps on NVIDIA Dynamic Scenes. Red indicates closer distances, while blue denotes farther distances.
  • ...and 3 more figures