Table of Contents
Fetching ...

PoseGaussian: Pose-Driven Novel View Synthesis for Robust 3D Human Reconstruction

Ju Shen, Chen Chen, Tam V. Nguyen, Vijayan K. Asari

TL;DR

PoseGaussian addresses the challenge of robust, real-time novel-view synthesis of dynamic humans by integrating pose as a structural prior and a temporal cue into a Gaussian Splatting pipeline. The method fuses pose heatmaps with image features for depth inference and uses a Temporal Pose Stabilizer to maintain temporal coherence, resulting in state-of-the-art perceptual and structural quality while delivering around 100 FPS. Key contributions include pose-guided depth fusion, temporal pose stabilization, and a pose-conditioned loss that aligns fused features with pose encodings, enabling robust generalization across datasets.

Abstract

We propose PoseGaussian, a pose-guided Gaussian Splatting framework for high-fidelity human novel view synthesis. Human body pose serves a dual purpose in our design: as a structural prior, it is fused with a color encoder to refine depth estimation; as a temporal cue, it is processed by a dedicated pose encoder to enhance temporal consistency across frames. These components are integrated into a fully differentiable, end-to-end trainable pipeline. Unlike prior works that use pose only as a condition or for warping, PoseGaussian embeds pose signals into both geometric and temporal stages to improve robustness and generalization. It is specifically designed to address challenges inherent in dynamic human scenes, such as articulated motion and severe self-occlusion. Notably, our framework achieves real-time rendering at 100 FPS, maintaining the efficiency of standard Gaussian Splatting pipelines. We validate our approach on ZJU-MoCap, THuman2.0, and in-house datasets, demonstrating state-of-the-art performance in perceptual quality and structural accuracy (PSNR 30.86, SSIM 0.979, LPIPS 0.028).

PoseGaussian: Pose-Driven Novel View Synthesis for Robust 3D Human Reconstruction

TL;DR

PoseGaussian addresses the challenge of robust, real-time novel-view synthesis of dynamic humans by integrating pose as a structural prior and a temporal cue into a Gaussian Splatting pipeline. The method fuses pose heatmaps with image features for depth inference and uses a Temporal Pose Stabilizer to maintain temporal coherence, resulting in state-of-the-art perceptual and structural quality while delivering around 100 FPS. Key contributions include pose-guided depth fusion, temporal pose stabilization, and a pose-conditioned loss that aligns fused features with pose encodings, enabling robust generalization across datasets.

Abstract

We propose PoseGaussian, a pose-guided Gaussian Splatting framework for high-fidelity human novel view synthesis. Human body pose serves a dual purpose in our design: as a structural prior, it is fused with a color encoder to refine depth estimation; as a temporal cue, it is processed by a dedicated pose encoder to enhance temporal consistency across frames. These components are integrated into a fully differentiable, end-to-end trainable pipeline. Unlike prior works that use pose only as a condition or for warping, PoseGaussian embeds pose signals into both geometric and temporal stages to improve robustness and generalization. It is specifically designed to address challenges inherent in dynamic human scenes, such as articulated motion and severe self-occlusion. Notably, our framework achieves real-time rendering at 100 FPS, maintaining the efficiency of standard Gaussian Splatting pipelines. We validate our approach on ZJU-MoCap, THuman2.0, and in-house datasets, demonstrating state-of-the-art performance in perceptual quality and structural accuracy (PSNR 30.86, SSIM 0.979, LPIPS 0.028).
Paper Structure (10 sections, 9 equations, 4 figures, 4 tables)

This paper contains 10 sections, 9 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: PoseGaussian Visualization and Comparison. Left: reconstructed pose-guided motion sequence reprojected into the original scene. Middle: visual comparison with GauHuman hu2024gauhuman. Right: performance chart compared to selected methods on the three metrics LPIPS, SSIM, PSNR. Compared methods: GauHuman ( hu2024gauhuman), AS zhou2024animatable, InstantNVR geng2023learning, HumanNeRF Zhao_2022_CVPR, NB peng2023implicit, AN peng2021animatable, NHP kwon2021neural, InstantAvatar jiang2023instantavatar, GP-NeRF chen2022gpnerf.
  • Figure 2: The PoseGaussian pipeline. Top: The overall workflow, illustrating the process from input color images to the predicted Gaussian parameter maps, specifically the rotation $\mathcal{M}_r(x)$, scale $\mathcal{M}_s(x)$, and opacity $\mathcal{M}_\alpha(x)$. Bottom: A detailed view of the Temporal Pose Stabilizer (TPS) module, along with visual annotations clarifying the roles of various modules and connections.
  • Figure 3: Challenging cases in pose synthesis. Top: occlusion scenario revealing difficult views of hidden regions (e.g., back, inner arms). Bottom: fast motion scenario comparing recent NeRF- and Gaussian-based methods (see Fig. \ref{['fig:teaser']} for references)
  • Figure 4: (Left) Impact of pose information. (Right) Impact of pose encoder configurations.