Animatable 3D Gaussian: Fast and High-Quality Reconstruction of Multiple Human Avatars
Yang Liu, Xiang Huang, Minghan Qin, Qinwei Lin, Haoqian Wang
TL;DR
This paper introduces Animatable 3D Gaussian, a dynamic human representation that extends static 3D Gaussians to handle motion and multiple people by binding Gaussians to a canonical skeleton and deforming them to posed space via linear blend skinning. It employs a multi-head hash encoder to capture pose-dependent shape and appearance and adds a time-dependent ambient occlusion module to model dynamic shadows, enabling high-quality novel view and pose synthesis with fast training and rendering. The approach achieves faster training (about 1/60 of baselines), lower memory usage (about 1/4), and significantly faster rendering (about 7x) compared to InstantAvatar, while scaling to multi-human scenes (e.g., ten people trained in around 25 seconds). Extensive experiments on monocular and multi-view datasets, including a new GalaBasketball dataset, demonstrate superior reconstruction quality, robustness to dynamic illumination, and applicability to real-time rendering in complex scenes.
Abstract
Neural radiance fields are capable of reconstructing high-quality drivable human avatars but are expensive to train and render and not suitable for multi-human scenes with complex shadows. To reduce consumption, we propose Animatable 3D Gaussian, which learns human avatars from input images and poses. We extend 3D Gaussians to dynamic human scenes by modeling a set of skinned 3D Gaussians and a corresponding skeleton in canonical space and deforming 3D Gaussians to posed space according to the input poses. We introduce a multi-head hash encoder for pose-dependent shape and appearance and a time-dependent ambient occlusion module to achieve high-quality reconstructions in scenes containing complex motions and dynamic shadows. On both novel view synthesis and novel pose synthesis tasks, our method achieves higher reconstruction quality than InstantAvatar with less training time (1/60), less GPU memory (1/4), and faster rendering speed (7x). Our method can be easily extended to multi-human scenes and achieve comparable novel view synthesis results on a scene with ten people in only 25 seconds of training.
