Table of Contents
Fetching ...

Animatable 3D Gaussian: Fast and High-Quality Reconstruction of Multiple Human Avatars

Yang Liu, Xiang Huang, Minghan Qin, Qinwei Lin, Haoqian Wang

TL;DR

This paper introduces Animatable 3D Gaussian, a dynamic human representation that extends static 3D Gaussians to handle motion and multiple people by binding Gaussians to a canonical skeleton and deforming them to posed space via linear blend skinning. It employs a multi-head hash encoder to capture pose-dependent shape and appearance and adds a time-dependent ambient occlusion module to model dynamic shadows, enabling high-quality novel view and pose synthesis with fast training and rendering. The approach achieves faster training (about 1/60 of baselines), lower memory usage (about 1/4), and significantly faster rendering (about 7x) compared to InstantAvatar, while scaling to multi-human scenes (e.g., ten people trained in around 25 seconds). Extensive experiments on monocular and multi-view datasets, including a new GalaBasketball dataset, demonstrate superior reconstruction quality, robustness to dynamic illumination, and applicability to real-time rendering in complex scenes.

Abstract

Neural radiance fields are capable of reconstructing high-quality drivable human avatars but are expensive to train and render and not suitable for multi-human scenes with complex shadows. To reduce consumption, we propose Animatable 3D Gaussian, which learns human avatars from input images and poses. We extend 3D Gaussians to dynamic human scenes by modeling a set of skinned 3D Gaussians and a corresponding skeleton in canonical space and deforming 3D Gaussians to posed space according to the input poses. We introduce a multi-head hash encoder for pose-dependent shape and appearance and a time-dependent ambient occlusion module to achieve high-quality reconstructions in scenes containing complex motions and dynamic shadows. On both novel view synthesis and novel pose synthesis tasks, our method achieves higher reconstruction quality than InstantAvatar with less training time (1/60), less GPU memory (1/4), and faster rendering speed (7x). Our method can be easily extended to multi-human scenes and achieve comparable novel view synthesis results on a scene with ten people in only 25 seconds of training.

Animatable 3D Gaussian: Fast and High-Quality Reconstruction of Multiple Human Avatars

TL;DR

This paper introduces Animatable 3D Gaussian, a dynamic human representation that extends static 3D Gaussians to handle motion and multiple people by binding Gaussians to a canonical skeleton and deforming them to posed space via linear blend skinning. It employs a multi-head hash encoder to capture pose-dependent shape and appearance and adds a time-dependent ambient occlusion module to model dynamic shadows, enabling high-quality novel view and pose synthesis with fast training and rendering. The approach achieves faster training (about 1/60 of baselines), lower memory usage (about 1/4), and significantly faster rendering (about 7x) compared to InstantAvatar, while scaling to multi-human scenes (e.g., ten people trained in around 25 seconds). Extensive experiments on monocular and multi-view datasets, including a new GalaBasketball dataset, demonstrate superior reconstruction quality, robustness to dynamic illumination, and applicability to real-time rendering in complex scenes.

Abstract

Neural radiance fields are capable of reconstructing high-quality drivable human avatars but are expensive to train and render and not suitable for multi-human scenes with complex shadows. To reduce consumption, we propose Animatable 3D Gaussian, which learns human avatars from input images and poses. We extend 3D Gaussians to dynamic human scenes by modeling a set of skinned 3D Gaussians and a corresponding skeleton in canonical space and deforming 3D Gaussians to posed space according to the input poses. We introduce a multi-head hash encoder for pose-dependent shape and appearance and a time-dependent ambient occlusion module to achieve high-quality reconstructions in scenes containing complex motions and dynamic shadows. On both novel view synthesis and novel pose synthesis tasks, our method achieves higher reconstruction quality than InstantAvatar with less training time (1/60), less GPU memory (1/4), and faster rendering speed (7x). Our method can be easily extended to multi-human scenes and achieve comparable novel view synthesis results on a scene with ten people in only 25 seconds of training.
Paper Structure (14 sections, 19 equations, 8 figures, 4 tables)

This paper contains 14 sections, 19 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Overview. The proposed animatable 3D Gaussian consists of a set of skinned 3D Gaussians and a corresponding canonical skeleton. Each skinned 3D Gaussian contains center $x_0$, rotation $R$, scale $S$, opacity $\alpha_0$, and skinning weights $\mathbf{w}$. First, we sample spherical harmonic coefficients $SH$, vertex displacement $\delta_x$, and ambient occlusion $ao$ from the multi-head hash-encoded parameter field according to the center $x_0$, where the multilayer perceptron for $ao$ requires an additional frequency encoded time $\gamma(t)$ as input. Next, we concatenate the sampled parameters, the original parameters, and a shifted center $x_0^{'}$ in canonical space. Finally, we deform 3D Gaussians to the posed space according to the input pose $S_t,T_t$ and render them to the image using 3D Gaussian rasterization 3D-GS.
  • Figure 2: 3D Gaussian Deformation. The rotation $R_c$ of 3D Gaussian in canonical space is deformed into the posed space using Eq. (\ref{['eq:linear blend skinning R']}), while view direction $d_t$ is implemented the inverse transformation in Eq. (\ref{['eq:inverse linear blend skinning d']}).
  • Figure 3: Qualitative Results on PeopleSnapshot alldieck2018detailed Dataset. We show the image quality of our method and InstantAvatar instantavatar at 5s and 30s training time. Compared to InstantAvatar, our method achieves higher reconstruction quality and a significant reduction in artifacts.
  • Figure 4: Ablation Study of Hash-Encoded Vertex Displacement. Without hash-encoded vertex displacement (w/o hash-vd), the centers of 3D Gaussians may diverge during the optimization process, while our hash-encoded vertex displacement (w/ hash-vd) converges to the ground truth shape.
  • Figure 5: Novel View Synthesis Results on Single-Human Scenes of GalaBasketball Dataset. We show the novel view synthesis quality of our method (with hash-encoded spherical harmonic coefficients) and InstantAvatar instantavatar.
  • ...and 3 more figures