ShapeGaussian: High-Fidelity 4D Human Reconstruction in Monocular Videos via Vision Priors
Zhenxiao Liang, Ning Zhang, Youbao Tang, Ruei-Sung Lin, Qixing Huang, Peng Chang, Jing Xiao
TL;DR
ShapeGaussian introduces a template-free pipeline for high-fidelity 4D human reconstruction from monocular videos by fusing vision priors with a dynamic Gaussian Splatting representation. It initializes a coarse geometry from data-driven priors and refines it with a neural deformation field, using multiple reference frames to handle invisibility and fast motion. The method employs a shape-aware initialization, depth alignment, and synchronized density control across frames, achieving robust, photorealistic reconstructions that outperform template-based approaches on benchmark datasets. This approach reduces artifacts from pose estimation errors and extends monocular 4D reconstruction reliability to more diverse, casual videos with complex deformations.
Abstract
We introduce ShapeGaussian, a high-fidelity, template-free method for 4D human reconstruction from casual monocular videos. Generic reconstruction methods lacking robust vision priors, such as 4DGS, struggle to capture high-deformation human motion without multi-view cues. While template-based approaches, primarily relying on SMPL, such as HUGS, can produce photorealistic results, they are highly susceptible to errors in human pose estimation, often leading to unrealistic artifacts. In contrast, ShapeGaussian effectively integrates template-free vision priors to achieve both high-fidelity and robust scene reconstructions. Our method follows a two-step pipeline: first, we learn a coarse, deformable geometry using pretrained models that estimate data-driven priors, providing a foundation for reconstruction. Then, we refine this geometry using a neural deformation model to capture fine-grained dynamic details. By leveraging 2D vision priors, we mitigate artifacts from erroneous pose estimation in template-based methods and employ multiple reference frames to resolve the invisibility issue of 2D keypoints in a template-free manner. Extensive experiments demonstrate that ShapeGaussian surpasses template-based methods in reconstruction accuracy, achieving superior visual quality and robustness across diverse human motions in casual monocular videos.
