4RC: 4D Reconstruction via Conditional Querying Anytime and Anywhere
Yihang Luo, Shangchen Zhou, Yushi Lan, Xingang Pan, Chen Change Loy
TL;DR
4RC tackles dynamic 3D scene understanding from monocular video by learning a unified 4D representation that jointly encodes geometry and motion. It employs an encode-once, query-anywhere paradigm: a ViT-based encoder produces a compact 4D latent $\\mathcal{F}$ from the full video, and a conditional decoder retrieves base geometry and time-dependent motion through a factorized output $P_i^{t_i\rightarrow\tau}=P_i^{t_i}+\\Delta P_i^{t_i\rightarrow\tau}$ for arbitrary source-target pairs. The method demonstrates state-of-the-art performance across 4D reconstruction tasks—including dense and sparse motion tracking, camera pose estimation, and multi-view reconstruction—while maintaining efficiency and flexibility, with a streaming variant S-4RC for online operation. These results highlight the practical potential of unified, queryable 4D representations for robotics, AR/VR, and content creation, and point to future work in scaling data and handling more chaotic dynamics.
Abstract
We present 4RC, a unified feed-forward framework for 4D reconstruction from monocular videos. Unlike existing approaches that typically decouple motion from geometry or produce limited 4D attributes such as sparse trajectories or two-view scene flow, 4RC learns a holistic 4D representation that jointly captures dense scene geometry and motion dynamics. At its core, 4RC introduces a novel encode-once, query-anywhere and anytime paradigm: a transformer backbone encodes the entire video into a compact spatio-temporal latent space, from which a conditional decoder can efficiently query 3D geometry and motion for any query frame at any target timestamp. To facilitate learning, we represent per-view 4D attributes in a minimally factorized form by decomposing them into base geometry and time-dependent relative motion. Extensive experiments demonstrate that 4RC outperforms prior and concurrent methods across a wide range of 4D reconstruction tasks.
