Expressive Whole-Body 3D Gaussian Avatar
Gyeongsik Moon, Takaaki Shiratori, Shunsuke Saito
TL;DR
ExAvatar addresses the challenge of creating expressive whole-body 3D avatars from short monocular videos by marrying SMPL-X with 3D Gaussian Splatting in a novel hybrid mesh-Gaussian representation. Per-vertex Gaussians inherit the SMPL-X topology and are driven by the SMPL-X facial expression space, while ID-dependent joint and face offsets improve co-registration and mitigate artifacts through connectivity regularizers. The approach uses two MLPs to regress Gaussian attributes from per-vertex triplane features, with pose-conditioned and pose-free pathways, and renders via 3DGS with carefully designed losses, including a dedicated face loss. Experiments on NeuMan and X-Humans demonstrate state-of-the-art realism in faces and hands and robustness to limited pose diversity, enabling expressive, monocular-video avatars suitable for real-time applications. Limitations include interior-mouth geometry and dynamic clothing, with future work proposed on priors-based facial canonicalization and relightability.
Abstract
Facial expression and hand motions are necessary to express our emotions and interact with the world. Nevertheless, most of the 3D human avatars modeled from a casually captured video only support body motions without facial expressions and hand motions.In this work, we present ExAvatar, an expressive whole-body 3D human avatar learned from a short monocular video. We design ExAvatar as a combination of the whole-body parametric mesh model (SMPL-X) and 3D Gaussian Splatting (3DGS). The main challenges are 1) a limited diversity of facial expressions and poses in the video and 2) the absence of 3D observations, such as 3D scans and RGBD images. The limited diversity in the video makes animations with novel facial expressions and poses non-trivial. In addition, the absence of 3D observations could cause significant ambiguity in human parts that are not observed in the video, which can result in noticeable artifacts under novel motions. To address them, we introduce our hybrid representation of the mesh and 3D Gaussians. Our hybrid representation treats each 3D Gaussian as a vertex on the surface with pre-defined connectivity information (i.e., triangle faces) between them following the mesh topology of SMPL-X. It makes our ExAvatar animatable with novel facial expressions by driven by the facial expression space of SMPL-X. In addition, by using connectivity-based regularizers, we significantly reduce artifacts in novel facial expressions and poses.
