DEGAS: Detailed Expressions on Full-Body Gaussian Avatars

Zhijing Shao; Duotun Wang; Qing-Yao Tian; Yao-Dong Yang; Hengyu Meng; Zeyu Cai; Bo Dong; Yu Zhang; Kang Zhang; Zeyu Wang

DEGAS: Detailed Expressions on Full-Body Gaussian Avatars

Zhijing Shao, Duotun Wang, Qing-Yao Tian, Yao-Dong Yang, Hengyu Meng, Zeyu Cai, Bo Dong, Yu Zhang, Kang Zhang, Zeyu Wang

TL;DR

DEGAS addresses the challenge of animatable full-body avatars with rich facial expressions by introducing a 3D Gaussian Splatting (3DGS) based model trained on multi-view video to learn a conditional variational autoencoder that maps body motion and a 2D expression latent into UV-space Gaussian maps. A pretrained 2D expression encoder (from DPE) drives facial expressions, bridging 2D talking-face technology with 3D full-body avatars, while pose-dependent Gaussian maps are generated via a pose-conditioned decoder and rendered with LBS deformations. The approach is evaluated on ActorsHQ and a new DREAMS-Avatar dataset, showing superior rendering quality and expression fidelity compared with state-of-the-art methods, and it is extended to audio-driven full-body talking avatars using SadTalker. The results demonstrate real-time capable, photorealistic full-body avatars suitable for interactive AI agents and telepresence, with ethical considerations and public data release discussed.

Abstract

Although neural rendering has made significant advances in creating lifelike, animatable full-body and head avatars, incorporating detailed expressions into full-body avatars remains largely unexplored. We present DEGAS, the first 3D Gaussian Splatting (3DGS)-based modeling method for full-body avatars with rich facial expressions. Trained on multiview videos of a given subject, our method learns a conditional variational autoencoder that takes both the body motion and facial expression as driving signals to generate Gaussian maps in the UV layout. To drive the facial expressions, instead of the commonly used 3D Morphable Models (3DMMs) in 3D head avatars, we propose to adopt the expression latent space trained solely on 2D portrait images, bridging the gap between 2D talking faces and 3D avatars. Leveraging the rendering capability of 3DGS and the rich expressiveness of the expression latent space, the learned avatars can be reenacted to reproduce photorealistic rendering images with subtle and accurate facial expressions. Experiments on an existing dataset and our newly proposed dataset of full-body talking avatars demonstrate the efficacy of our method. We also propose an audio-driven extension of our method with the help of 2D talking faces, opening new possibilities for interactive AI agents.

DEGAS: Detailed Expressions on Full-Body Gaussian Avatars

TL;DR

Abstract

Paper Structure (23 sections, 10 equations, 13 figures, 6 tables)

This paper contains 23 sections, 10 equations, 13 figures, 6 tables.

Introduction
Related Work
3D Avatar Representations
Talking Face Video Generation
Choice of Driving Signal
Methodology
Preliminaries: 3D Gaussian Splatting
Driving Signal
Conditional Variational Autoencoder
Gaussian Maps and LBS
Loss and Training
Experiments
Comparison on Full-body Avatars
Comparison on DREAMS-Avatar
Discussion w.r.t. AnimatableGaussians
...and 8 more sections

Figures (13)

Figure 1: Photorealistic rendering of full-body avatars using our method on (a) ActorsHQ dataset and (b) our proposed DREAMS-Avatar dataset, (c) with rich facial expressions.
Figure 2: The pipeline of our method. DEGAS takes face signal from the pretrained expression encoder of DPE DPE:CVPR:2023, which is injected to the body signal from SMPL-X SMPL-X:CVPR:2019. The pose-dependent Gaussian maps generated by the convolutional decoder are applied to the pose-independent maps for 3DGS 3DGS:SIGGRAPH:2023 rendering.
Figure 3: Tiling and masking of Pose $\boldsymbol\theta$ Embedding. Joint angles are filled to corresponding areas in the UV layout where the joints can affect through skinning. Colors visualize the skinning weights, which are converted to binary masks.
Figure 4: Training with faces from multiviews. During training, we extract expressions from multiple views and use a randomly averaged latent code as the face signal.
Figure 5: The DREAMS-Avatar dataset. 6 subjects captured with 32 12MP cameras performing large body poses and rich facial expressions, e.g., smiling, laughing, angry, surprised, and a sequence of talking or singing.
...and 8 more figures

DEGAS: Detailed Expressions on Full-Body Gaussian Avatars

TL;DR

Abstract

DEGAS: Detailed Expressions on Full-Body Gaussian Avatars

Authors

TL;DR

Abstract

Table of Contents

Figures (13)