Table of Contents
Fetching ...

4RC: 4D Reconstruction via Conditional Querying Anytime and Anywhere

Yihang Luo, Shangchen Zhou, Yushi Lan, Xingang Pan, Chen Change Loy

TL;DR

4RC tackles dynamic 3D scene understanding from monocular video by learning a unified 4D representation that jointly encodes geometry and motion. It employs an encode-once, query-anywhere paradigm: a ViT-based encoder produces a compact 4D latent $\\mathcal{F}$ from the full video, and a conditional decoder retrieves base geometry and time-dependent motion through a factorized output $P_i^{t_i\rightarrow\tau}=P_i^{t_i}+\\Delta P_i^{t_i\rightarrow\tau}$ for arbitrary source-target pairs. The method demonstrates state-of-the-art performance across 4D reconstruction tasks—including dense and sparse motion tracking, camera pose estimation, and multi-view reconstruction—while maintaining efficiency and flexibility, with a streaming variant S-4RC for online operation. These results highlight the practical potential of unified, queryable 4D representations for robotics, AR/VR, and content creation, and point to future work in scaling data and handling more chaotic dynamics.

Abstract

We present 4RC, a unified feed-forward framework for 4D reconstruction from monocular videos. Unlike existing approaches that typically decouple motion from geometry or produce limited 4D attributes such as sparse trajectories or two-view scene flow, 4RC learns a holistic 4D representation that jointly captures dense scene geometry and motion dynamics. At its core, 4RC introduces a novel encode-once, query-anywhere and anytime paradigm: a transformer backbone encodes the entire video into a compact spatio-temporal latent space, from which a conditional decoder can efficiently query 3D geometry and motion for any query frame at any target timestamp. To facilitate learning, we represent per-view 4D attributes in a minimally factorized form by decomposing them into base geometry and time-dependent relative motion. Extensive experiments demonstrate that 4RC outperforms prior and concurrent methods across a wide range of 4D reconstruction tasks.

4RC: 4D Reconstruction via Conditional Querying Anytime and Anywhere

TL;DR

4RC tackles dynamic 3D scene understanding from monocular video by learning a unified 4D representation that jointly encodes geometry and motion. It employs an encode-once, query-anywhere paradigm: a ViT-based encoder produces a compact 4D latent from the full video, and a conditional decoder retrieves base geometry and time-dependent motion through a factorized output for arbitrary source-target pairs. The method demonstrates state-of-the-art performance across 4D reconstruction tasks—including dense and sparse motion tracking, camera pose estimation, and multi-view reconstruction—while maintaining efficiency and flexibility, with a streaming variant S-4RC for online operation. These results highlight the practical potential of unified, queryable 4D representations for robotics, AR/VR, and content creation, and point to future work in scaling data and handling more chaotic dynamics.

Abstract

We present 4RC, a unified feed-forward framework for 4D reconstruction from monocular videos. Unlike existing approaches that typically decouple motion from geometry or produce limited 4D attributes such as sparse trajectories or two-view scene flow, 4RC learns a holistic 4D representation that jointly captures dense scene geometry and motion dynamics. At its core, 4RC introduces a novel encode-once, query-anywhere and anytime paradigm: a transformer backbone encodes the entire video into a compact spatio-temporal latent space, from which a conditional decoder can efficiently query 3D geometry and motion for any query frame at any target timestamp. To facilitate learning, we represent per-view 4D attributes in a minimally factorized form by decomposing them into base geometry and time-dependent relative motion. Extensive experiments demonstrate that 4RC outperforms prior and concurrent methods across a wide range of 4D reconstruction tasks.
Paper Structure (24 sections, 8 equations, 7 figures, 8 tables)

This paper contains 24 sections, 8 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: 4RC (pronounced "ARC") enables unified and complete 4D Reconstruction via Conditional querying from monocular videos in a single feed-forward pass. It jointly recovers camera poses and dense per-frame geometry, while supporting flexible querying of dense 3D motion from arbitrary source frames to any target timestamp.
  • Figure 2: Overall architecture of 4RC. Video frames are patchified and augmented with camera and time tokens, then jointly encoded by a single transformer into a compact 4D latent representation $\mathcal{F}$, from which a conditional decoder with disentangled geometry and motion heads enables flexible querying of 3D geometry and motion for arbitrary source views at arbitrary target timestamps.
  • Figure 3: Qualitative comparison of dynamic tracking on DAVISdavis_CVPR_2016. We visualize the dynamic reconstruction results, including the geometry at the first and last frames, as well as the dynamic object trajectories rendered as rainbow-colored paths from the first view. As shown in the top example, our method successfully handles occlusion when the motorcycle becomes temporarily invisible. In contrast, the two-view method St4RTrack lacks global temporal context and therefore predicts an incorrect trajectory. In the second and third examples, our method accurately reconstructs complex and large-scale motions while preserving high-quality geometry, while other methods produce inconsistent motion trajectories and degraded geometry.
  • Figure 4: Visualization of in-the-wild examples. 4RC demonstrates accurate geometry reconstruction and motion modeling in both static and dynamic scenes.
  • Figure 5: Qualitative ablation visualizations. The first row shows the effectiveness of cross-attention in the motion head: without it, although the model outputs rough trajectories, it fails to capture fine details such as the motion of the girl's legs and hands when she is at the peak of a jump. The second row illustrates that outputting motion as point clouds can lead to inconsistent trajectories as it requires re-predicting base geometry for each time step.
  • ...and 2 more figures