Table of Contents
Fetching ...

Deformable 3D Gaussian Splatting for Animatable Human Avatars

HyunJun Jung, Nikolas Brasch, Jifei Song, Eduardo Perez-Pellitero, Yiren Zhou, Zhihao Li, Nassir Navab, Benjamin Busam

TL;DR

ParDy-Human tackles animatable human avatars from RGB inputs with an explicit deformable 3D Gaussian Splatting framework. It leverages SMPL-driven per-vertex deformations to move canonical Gaussians, followed by a Deformation Refinement Module that captures garment motion, and uses background separation to train without masks. The approach enables full-resolution rendering with minimal input views on consumer hardware and demonstrates strong qualitative and quantitative results on ZJU-MoCap and THUman4.0, outperforming state-of-the-art baselines in perceptual quality while maintaining efficiency.

Abstract

Recent advances in neural radiance fields enable novel view synthesis of photo-realistic images in dynamic settings, which can be applied to scenarios with human animation. Commonly used implicit backbones to establish accurate models, however, require many input views and additional annotations such as human masks, UV maps and depth maps. In this work, we propose ParDy-Human (Parameterized Dynamic Human Avatar), a fully explicit approach to construct a digital avatar from as little as a single monocular sequence. ParDy-Human introduces parameter-driven dynamics into 3D Gaussian Splatting where 3D Gaussians are deformed by a human pose model to animate the avatar. Our method is composed of two parts: A first module that deforms canonical 3D Gaussians according to SMPL vertices and a consecutive module that further takes their designed joint encodings and predicts per Gaussian deformations to deal with dynamics beyond SMPL vertex deformations. Images are then synthesized by a rasterizer. ParDy-Human constitutes an explicit model for realistic dynamic human avatars which requires significantly fewer training views and images. Our avatars learning is free of additional annotations such as masks and can be trained with variable backgrounds while inferring full-resolution images efficiently even on consumer hardware. We provide experimental evidence to show that ParDy-Human outperforms state-of-the-art methods on ZJU-MoCap and THUman4.0 datasets both quantitatively and visually.

Deformable 3D Gaussian Splatting for Animatable Human Avatars

TL;DR

ParDy-Human tackles animatable human avatars from RGB inputs with an explicit deformable 3D Gaussian Splatting framework. It leverages SMPL-driven per-vertex deformations to move canonical Gaussians, followed by a Deformation Refinement Module that captures garment motion, and uses background separation to train without masks. The approach enables full-resolution rendering with minimal input views on consumer hardware and demonstrates strong qualitative and quantitative results on ZJU-MoCap and THUman4.0, outperforming state-of-the-art baselines in perceptual quality while maintaining efficiency.

Abstract

Recent advances in neural radiance fields enable novel view synthesis of photo-realistic images in dynamic settings, which can be applied to scenarios with human animation. Commonly used implicit backbones to establish accurate models, however, require many input views and additional annotations such as human masks, UV maps and depth maps. In this work, we propose ParDy-Human (Parameterized Dynamic Human Avatar), a fully explicit approach to construct a digital avatar from as little as a single monocular sequence. ParDy-Human introduces parameter-driven dynamics into 3D Gaussian Splatting where 3D Gaussians are deformed by a human pose model to animate the avatar. Our method is composed of two parts: A first module that deforms canonical 3D Gaussians according to SMPL vertices and a consecutive module that further takes their designed joint encodings and predicts per Gaussian deformations to deal with dynamics beyond SMPL vertex deformations. Images are then synthesized by a rasterizer. ParDy-Human constitutes an explicit model for realistic dynamic human avatars which requires significantly fewer training views and images. Our avatars learning is free of additional annotations such as masks and can be trained with variable backgrounds while inferring full-resolution images efficiently even on consumer hardware. We provide experimental evidence to show that ParDy-Human outperforms state-of-the-art methods on ZJU-MoCap and THUman4.0 datasets both quantitatively and visually.
Paper Structure (27 sections, 8 equations, 13 figures, 4 tables)

This paper contains 27 sections, 8 equations, 13 figures, 4 tables.

Figures (13)

  • Figure 1: ParDy-Human constitutes an explicit dynamic human avatar that can be re-posed via SMPL loper2015smpl parameters. It utilizes the design of a deformable version of 3D Gaussian Splatting kerbl20233d. ParDy-Human, unlike existing implicit methods, can be trained with significantly fewer camera views and less human poses. While being free of ground truth mask for training, it generalizes well to novel human poses as shown in the above reposed results on the individuals from ZJU-MoCap peng2021neural and THUman4.0 zheng2022structured datasets.
  • Figure 2: Overview of Avatar Generation Framework. (a) ParDy-Human starts by initializing Gaussians on a sphere for the background and a canonical SMPL loper2015smpl mesh for the human. (b) the Gaussians are updated during the training. (c) For inference, the background Gaussians are removed leaving only the avatar. (d) the Canonical human Gaussians are deformed according to the SMPL vertex deformations and learned residual refinement. (e) the deformed Gaussians are rasterized to synthesize an image output under a given pose.
  • Figure 3: Training Pipeline Overview. ParDy-Human is a fully explicit animatable human representation based on 3D Gaussians kerbl20233d. Information from images $I_j$, $1\leq j \leq t$ of the $n$-th camera $\text{Cam}_n$ are integrated into the avatar by using the camera pose $T_{Cam}(j)$, human shape $\beta$, and pose $\theta_{j}$ parameters (left). Correspondences between Gaussians of a Canonical and Posed Human are established by a Per Vertex Deformation Module (centre to left, black arrows). Residual corrections of Gaussians are performed using a Deformation Refinement Module (DRM) (centre to right, black arrows) before image synthesis through rasterization (right). The rendered output can then be compared to ground truth input images to calculate gradients and update both the DRM and human avatar (orange arrows)
  • Figure 4: Inference Pipeline Overview. During inference time, we first filter out the background Gaussians (left) and then deform the canonical human avatar. A coarse deformation is done first using SMPL loper2015smpl parameters followed by the DRM correction (centre to right). The output is an animated human without background (right).
  • Figure 5: Issues in Datasets. Some scenes in the ZJU dataset peng2021neural suffer from inaccurate extrinsic calibration. The rendered SMPL meshes on these images are not consistent over cameras (a). On the other hand, the mask in the THUman dataset zheng2022structured suffers from over- and under-segmentation artifacts depending on the lighting conditions (b).
  • ...and 8 more figures