Table of Contents
Fetching ...

ShapeGaussian: High-Fidelity 4D Human Reconstruction in Monocular Videos via Vision Priors

Zhenxiao Liang, Ning Zhang, Youbao Tang, Ruei-Sung Lin, Qixing Huang, Peng Chang, Jing Xiao

TL;DR

ShapeGaussian introduces a template-free pipeline for high-fidelity 4D human reconstruction from monocular videos by fusing vision priors with a dynamic Gaussian Splatting representation. It initializes a coarse geometry from data-driven priors and refines it with a neural deformation field, using multiple reference frames to handle invisibility and fast motion. The method employs a shape-aware initialization, depth alignment, and synchronized density control across frames, achieving robust, photorealistic reconstructions that outperform template-based approaches on benchmark datasets. This approach reduces artifacts from pose estimation errors and extends monocular 4D reconstruction reliability to more diverse, casual videos with complex deformations.

Abstract

We introduce ShapeGaussian, a high-fidelity, template-free method for 4D human reconstruction from casual monocular videos. Generic reconstruction methods lacking robust vision priors, such as 4DGS, struggle to capture high-deformation human motion without multi-view cues. While template-based approaches, primarily relying on SMPL, such as HUGS, can produce photorealistic results, they are highly susceptible to errors in human pose estimation, often leading to unrealistic artifacts. In contrast, ShapeGaussian effectively integrates template-free vision priors to achieve both high-fidelity and robust scene reconstructions. Our method follows a two-step pipeline: first, we learn a coarse, deformable geometry using pretrained models that estimate data-driven priors, providing a foundation for reconstruction. Then, we refine this geometry using a neural deformation model to capture fine-grained dynamic details. By leveraging 2D vision priors, we mitigate artifacts from erroneous pose estimation in template-based methods and employ multiple reference frames to resolve the invisibility issue of 2D keypoints in a template-free manner. Extensive experiments demonstrate that ShapeGaussian surpasses template-based methods in reconstruction accuracy, achieving superior visual quality and robustness across diverse human motions in casual monocular videos.

ShapeGaussian: High-Fidelity 4D Human Reconstruction in Monocular Videos via Vision Priors

TL;DR

ShapeGaussian introduces a template-free pipeline for high-fidelity 4D human reconstruction from monocular videos by fusing vision priors with a dynamic Gaussian Splatting representation. It initializes a coarse geometry from data-driven priors and refines it with a neural deformation field, using multiple reference frames to handle invisibility and fast motion. The method employs a shape-aware initialization, depth alignment, and synchronized density control across frames, achieving robust, photorealistic reconstructions that outperform template-based approaches on benchmark datasets. This approach reduces artifacts from pose estimation errors and extends monocular 4D reconstruction reliability to more diverse, casual videos with complex deformations.

Abstract

We introduce ShapeGaussian, a high-fidelity, template-free method for 4D human reconstruction from casual monocular videos. Generic reconstruction methods lacking robust vision priors, such as 4DGS, struggle to capture high-deformation human motion without multi-view cues. While template-based approaches, primarily relying on SMPL, such as HUGS, can produce photorealistic results, they are highly susceptible to errors in human pose estimation, often leading to unrealistic artifacts. In contrast, ShapeGaussian effectively integrates template-free vision priors to achieve both high-fidelity and robust scene reconstructions. Our method follows a two-step pipeline: first, we learn a coarse, deformable geometry using pretrained models that estimate data-driven priors, providing a foundation for reconstruction. Then, we refine this geometry using a neural deformation model to capture fine-grained dynamic details. By leveraging 2D vision priors, we mitigate artifacts from erroneous pose estimation in template-based methods and employ multiple reference frames to resolve the invisibility issue of 2D keypoints in a template-free manner. Extensive experiments demonstrate that ShapeGaussian surpasses template-based methods in reconstruction accuracy, achieving superior visual quality and robustness across diverse human motions in casual monocular videos.
Paper Structure (20 sections, 15 equations, 7 figures, 4 tables)

This paper contains 20 sections, 15 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: We propose ShapeGaussian to achieve high-fidelity dynamic scene reconstruction from monocular, human-centric videos in a template-free manner. Generic reconstruction methods without robust vision priors, such as 4DGS wu_4d_2023, struggle to capture high-deformation human motion without multi-view cues. Although template-based approaches (primarily SMPL loper_smpl_2015) such as HUGS kocabas_hugs_2023 can produce photorealistic results, they are susceptible to errors in human pose estimation, often leading to unrealistic artifacts that compromise applicability. In contrast, our method delivers both high-fidelity and high-quality scene reconstructions by incorporating template-free vision priors effectively.
  • Figure 2: Overview of our proposed method. Initialization of the deformation network and 3D Gaussians in the reference frames is performed as described in Sec. \ref{['subsec:init']}. Following this, a joint optimization is conducted to refine both the deformation network and 3D Gaussians with synchronized density control across all frames. Yellow boxes denote all input elements, orange boxes indicate intermediate results, and green boxes represent the actions being taken.
  • Figure 3: Visualization of depth alignment.
  • Figure 4: Comparison of keypoint initializations. The erroneous initialization of keypoints is the primary factor degrading the reconstruction performance of Shape-of-Motion wang_shape_2024 on the NeuMan dataset. Note that, for our method, we visualize only the keypoints in the other frame that remain visible in the reference frame.
  • Figure 5: Qualitative comparison of baselines and our method on real dataset of casual monocular videos.
  • ...and 2 more figures