DEGAS: Detailed Expressions on Full-Body Gaussian Avatars
Zhijing Shao, Duotun Wang, Qing-Yao Tian, Yao-Dong Yang, Hengyu Meng, Zeyu Cai, Bo Dong, Yu Zhang, Kang Zhang, Zeyu Wang
TL;DR
DEGAS addresses the challenge of animatable full-body avatars with rich facial expressions by introducing a 3D Gaussian Splatting (3DGS) based model trained on multi-view video to learn a conditional variational autoencoder that maps body motion and a 2D expression latent into UV-space Gaussian maps. A pretrained 2D expression encoder (from DPE) drives facial expressions, bridging 2D talking-face technology with 3D full-body avatars, while pose-dependent Gaussian maps are generated via a pose-conditioned decoder and rendered with LBS deformations. The approach is evaluated on ActorsHQ and a new DREAMS-Avatar dataset, showing superior rendering quality and expression fidelity compared with state-of-the-art methods, and it is extended to audio-driven full-body talking avatars using SadTalker. The results demonstrate real-time capable, photorealistic full-body avatars suitable for interactive AI agents and telepresence, with ethical considerations and public data release discussed.
Abstract
Although neural rendering has made significant advances in creating lifelike, animatable full-body and head avatars, incorporating detailed expressions into full-body avatars remains largely unexplored. We present DEGAS, the first 3D Gaussian Splatting (3DGS)-based modeling method for full-body avatars with rich facial expressions. Trained on multiview videos of a given subject, our method learns a conditional variational autoencoder that takes both the body motion and facial expression as driving signals to generate Gaussian maps in the UV layout. To drive the facial expressions, instead of the commonly used 3D Morphable Models (3DMMs) in 3D head avatars, we propose to adopt the expression latent space trained solely on 2D portrait images, bridging the gap between 2D talking faces and 3D avatars. Leveraging the rendering capability of 3DGS and the rich expressiveness of the expression latent space, the learned avatars can be reenacted to reproduce photorealistic rendering images with subtle and accurate facial expressions. Experiments on an existing dataset and our newly proposed dataset of full-body talking avatars demonstrate the efficacy of our method. We also propose an audio-driven extension of our method with the help of 2D talking faces, opening new possibilities for interactive AI agents.
