Table of Contents
Fetching ...

SparseCam4D: Spatio-Temporally Consistent 4D Reconstruction from Sparse Cameras

Weihong Pan, Xiaoyu Zhang, Zhuang Zhang, Zhichao Ye, Nan Wang, Haomin Liu, Guofeng Zhang

Abstract

High-quality 4D reconstruction enables photorealistic and immersive rendering of the dynamic real world. However, unlike static scenes that can be fully captured with a single camera, high-quality dynamic scenes typically require dense arrays of tens or even hundreds of synchronized cameras. Dependence on such costly lab setups severely limits practical scalability. The reliance on such costly lab setups severely limits practical scalability. To this end, we propose a sparse-camera dynamic reconstruction framework that exploits abundant yet inconsistent generative observations. Our key innovation is the Spatio-Temporal Distortion Field, which provides a unified mechanism for modeling inconsistencies in generative observations across both spatial and temporal dimensions. Building on this, we develop a complete pipeline that enables 4D reconstruction from sparse and uncalibrated camera inputs. We evaluate our method on multi-camera dynamic scene benchmarks, achieving spatio-temporally consistent high-fidelity renderings and significantly outperforming existing approaches.

SparseCam4D: Spatio-Temporally Consistent 4D Reconstruction from Sparse Cameras

Abstract

High-quality 4D reconstruction enables photorealistic and immersive rendering of the dynamic real world. However, unlike static scenes that can be fully captured with a single camera, high-quality dynamic scenes typically require dense arrays of tens or even hundreds of synchronized cameras. Dependence on such costly lab setups severely limits practical scalability. The reliance on such costly lab setups severely limits practical scalability. To this end, we propose a sparse-camera dynamic reconstruction framework that exploits abundant yet inconsistent generative observations. Our key innovation is the Spatio-Temporal Distortion Field, which provides a unified mechanism for modeling inconsistencies in generative observations across both spatial and temporal dimensions. Building on this, we develop a complete pipeline that enables 4D reconstruction from sparse and uncalibrated camera inputs. We evaluate our method on multi-camera dynamic scene benchmarks, achieving spatio-temporally consistent high-fidelity renderings and significantly outperforming existing approaches.

Paper Structure

This paper contains 29 sections, 11 equations, 19 figures, 12 tables, 1 algorithm.

Figures (19)

  • Figure 1: Novel view rendering comparison. With as few as 2-3 cameras, our approach reconstructs high-quality dynamic scenes with spatio-temporal consistency and photorealistic quality. Please refer to our project page for additional dynamic results.
  • Figure 2: Spatio-temporal inconsistency. Real cameras (grey) capture consistent content of multi-view dynamic scene, while generative results (orange) include additional observations at different poses and time. Inconsistencies across poses at the same time are referred to as spatial inconsistencies, and inconsistencies across time at the same pose are referred to as temporal inconsistencies.
  • Figure 3: Method overview. Given a generated frame at temporal index $t$ and pose index $s$, each 4D Gaussian at $c=(x,y,z)$ is projected onto the planes of the Spatio-Temporal Distortion Field to obtain deformation features, which are then decoded by a small MLP to produce the deformation values. We use separate photometric losses for real and generated frames, and additionally introduce regularization terms on pose, feature plane, and spatial smoothness to enhance optimization stability.
  • Figure 4: Qualitative Comparisons of different methods on Technicolor sabater2017dataset, Neural 3D Video li2022neural, and Nvidia Dynamic Scenes yoon2020novel Datasets. We conduct comparisons with representative dynamic scene reconstruction methods: MonoFusion wang2025monofusion, 4DGS wu20244d, 4D-Rotor duan20244d, and Realtime4DGS yang2023real. MonoFusion$^*$ is our reproduced version. Our method significantly outperforms other baselines, producing visually reliable results with sharper details. Please zoom in for more details. Additional qualitative comparisons are included in the supplementary material.
  • Figure 5: Spatio-Temporal Consistency. Rendering results (top) and space-time slices (bottom) constructed by concatenating the red pixel locations across all time steps, demonstrate that direct reconstruction from diffusion observations leads to severe blur and temporal instability(e.g., the moving hand at the bottom right).
  • ...and 14 more figures