Table of Contents
Fetching ...

Reconstructing 4D Spatial Intelligence: A Survey

Yukang Cao, Jiahao Lu, Zhisheng Huang, Zhuowen Shen, Chengfeng Zhao, Fangzhou Hong, Zhaoxi Chen, Xin Li, Wenping Wang, Yuan Liu, Ziwei Liu

TL;DR

This survey organizes 4D spatial intelligence from video into five progressive levels, spanning low-level cues to physics-constrained dynamics. It synthesizes advances in depth, pose, and 3D tracking (Level 1), scene representations and large-scale reconstructions (Level 2), dynamic 4D scenes (Level 3), interactions among scene components (Level 4), and physically grounded modeling (Level 5). By mapping representative methods, datasets, and architectural trends (e.g., NeRF, 3D Gaussian Splatting, SMPL-based modeling, and differentiable physics), the paper highlights current capabilities and critical gaps. The authors also articulate challenges and future directions to push toward richer, physically plausible 4D worlds for applications in AR/VR, embodied AI, and robotics, and provide an up-to-date project resource for ongoing developments.

Abstract

Reconstructing 4D spatial intelligence from visual observations has long been a central yet challenging task in computer vision, with broad real-world applications. These range from entertainment domains like movies, where the focus is often on reconstructing fundamental visual elements, to embodied AI, which emphasizes interaction modeling and physical realism. Fueled by rapid advances in 3D representations and deep learning architectures, the field has evolved quickly, outpacing the scope of previous surveys. Additionally, existing surveys rarely offer a comprehensive analysis of the hierarchical structure of 4D scene reconstruction. To address this gap, we present a new perspective that organizes existing methods into five progressive levels of 4D spatial intelligence: (1) Level 1 -- reconstruction of low-level 3D attributes (e.g., depth, pose, and point maps); (2) Level 2 -- reconstruction of 3D scene components (e.g., objects, humans, structures); (3) Level 3 -- reconstruction of 4D dynamic scenes; (4) Level 4 -- modeling of interactions among scene components; and (5) Level 5 -- incorporation of physical laws and constraints. We conclude the survey by discussing the key challenges at each level and highlighting promising directions for advancing toward even richer levels of 4D spatial intelligence. To track ongoing developments, we maintain an up-to-date project page: https://github.com/yukangcao/Awesome-4D-Spatial-Intelligence.

Reconstructing 4D Spatial Intelligence: A Survey

TL;DR

This survey organizes 4D spatial intelligence from video into five progressive levels, spanning low-level cues to physics-constrained dynamics. It synthesizes advances in depth, pose, and 3D tracking (Level 1), scene representations and large-scale reconstructions (Level 2), dynamic 4D scenes (Level 3), interactions among scene components (Level 4), and physically grounded modeling (Level 5). By mapping representative methods, datasets, and architectural trends (e.g., NeRF, 3D Gaussian Splatting, SMPL-based modeling, and differentiable physics), the paper highlights current capabilities and critical gaps. The authors also articulate challenges and future directions to push toward richer, physically plausible 4D worlds for applications in AR/VR, embodied AI, and robotics, and provide an up-to-date project resource for ongoing developments.

Abstract

Reconstructing 4D spatial intelligence from visual observations has long been a central yet challenging task in computer vision, with broad real-world applications. These range from entertainment domains like movies, where the focus is often on reconstructing fundamental visual elements, to embodied AI, which emphasizes interaction modeling and physical realism. Fueled by rapid advances in 3D representations and deep learning architectures, the field has evolved quickly, outpacing the scope of previous surveys. Additionally, existing surveys rarely offer a comprehensive analysis of the hierarchical structure of 4D scene reconstruction. To address this gap, we present a new perspective that organizes existing methods into five progressive levels of 4D spatial intelligence: (1) Level 1 -- reconstruction of low-level 3D attributes (e.g., depth, pose, and point maps); (2) Level 2 -- reconstruction of 3D scene components (e.g., objects, humans, structures); (3) Level 3 -- reconstruction of 4D dynamic scenes; (4) Level 4 -- modeling of interactions among scene components; and (5) Level 5 -- incorporation of physical laws and constraints. We conclude the survey by discussing the key challenges at each level and highlighting promising directions for advancing toward even richer levels of 4D spatial intelligence. To track ongoing developments, we maintain an up-to-date project page: https://github.com/yukangcao/Awesome-4D-Spatial-Intelligence.

Paper Structure

This paper contains 28 sections, 5 equations, 8 figures.

Figures (8)

  • Figure 1: Classification of 4D spatial intelligence by level. Specifically, in this survey, we categorize the methods of reconstructing 3D spatial intelligence from video into five levels: (1) low-level 3D cues, (2) 3D scene components, (3) 4D dynamic scenes, (4) modeling of interactions among scene components, and (5) incorporation of physical laws and constraints.
  • Figure 2: The paradigms of methods for reconstructing low-level cues from video input. (I) Video-based depth reconstruction methods recently leverage the diffusion model to obtain the depth maps; (II) Methods for reconstructing camera pose from video input typically employ the neural network to infer the camera pose based on the encoded image features; (III) 3D tracking methods uses point tracker and transformers to achieve 3D tracking from video input; (IV) Recent methods, such as VGGT, apply DINO to extract the features and then train transformer-based DPT heads to infer the unified 3D attributes. "Enc.", "Dec.", "Spt. Grid", "Qry. Points", and "Cam." denote "Encoder", "Decoder", "Supporting Grid", "Query Points", and "Camera Head" correspondingly.
  • Figure 3: The paradigms of methods for reconstructing 3D scene components from video input. 3D reconstruction methods for small-scale and large-scale scenes often share similar architectures, differing primarily in the spatial extent they handle. As shown in the left panel (Image source: MipNeRF360 mipnerf3602022), small-scale scenes correspond to the unaffected domain. large-scale scenes additionally incorporate a contracted domain. Examples illustrating both scene types are provided in the right panel.
  • Figure 4: The paradigms of methods for reconstructing dynamic scenes from video input. Methods in this domain typically adopt one of two strategies for temporal modeling: (I) explicitly incorporating time as an additional input to extend a static 3D representation, or (II) reconstructing a canonical 3D space and learning its deformation over time. "Def." denotes "Deformation module".
  • Figure 5: The illustrations of methods for reconstructing 4D dynamic humans from video input. Human-centric dynamic modeling approaches are generally categorized based on their representations: (I) methods that apply SMPL parametric model as their representation to derive the human pose and shape parameters (image source: Neural Body Fitting omran2018neural), (II) methods that similarly apply SMPL but focus more on the prediction based on egocentric videos (image source: EgoAllo yi2025egoallo), and (III) appearance-rich non-parametric methods that are capable of reconstructing the textured topologies, such as garments and accessories, from video data (image source: Neural Body peng2021neuralbody).
  • ...and 3 more figures