Table of Contents
Fetching ...

Expressive Whole-Body 3D Gaussian Avatar

Gyeongsik Moon, Takaaki Shiratori, Shunsuke Saito

TL;DR

ExAvatar addresses the challenge of creating expressive whole-body 3D avatars from short monocular videos by marrying SMPL-X with 3D Gaussian Splatting in a novel hybrid mesh-Gaussian representation. Per-vertex Gaussians inherit the SMPL-X topology and are driven by the SMPL-X facial expression space, while ID-dependent joint and face offsets improve co-registration and mitigate artifacts through connectivity regularizers. The approach uses two MLPs to regress Gaussian attributes from per-vertex triplane features, with pose-conditioned and pose-free pathways, and renders via 3DGS with carefully designed losses, including a dedicated face loss. Experiments on NeuMan and X-Humans demonstrate state-of-the-art realism in faces and hands and robustness to limited pose diversity, enabling expressive, monocular-video avatars suitable for real-time applications. Limitations include interior-mouth geometry and dynamic clothing, with future work proposed on priors-based facial canonicalization and relightability.

Abstract

Facial expression and hand motions are necessary to express our emotions and interact with the world. Nevertheless, most of the 3D human avatars modeled from a casually captured video only support body motions without facial expressions and hand motions.In this work, we present ExAvatar, an expressive whole-body 3D human avatar learned from a short monocular video. We design ExAvatar as a combination of the whole-body parametric mesh model (SMPL-X) and 3D Gaussian Splatting (3DGS). The main challenges are 1) a limited diversity of facial expressions and poses in the video and 2) the absence of 3D observations, such as 3D scans and RGBD images. The limited diversity in the video makes animations with novel facial expressions and poses non-trivial. In addition, the absence of 3D observations could cause significant ambiguity in human parts that are not observed in the video, which can result in noticeable artifacts under novel motions. To address them, we introduce our hybrid representation of the mesh and 3D Gaussians. Our hybrid representation treats each 3D Gaussian as a vertex on the surface with pre-defined connectivity information (i.e., triangle faces) between them following the mesh topology of SMPL-X. It makes our ExAvatar animatable with novel facial expressions by driven by the facial expression space of SMPL-X. In addition, by using connectivity-based regularizers, we significantly reduce artifacts in novel facial expressions and poses.

Expressive Whole-Body 3D Gaussian Avatar

TL;DR

ExAvatar addresses the challenge of creating expressive whole-body 3D avatars from short monocular videos by marrying SMPL-X with 3D Gaussian Splatting in a novel hybrid mesh-Gaussian representation. Per-vertex Gaussians inherit the SMPL-X topology and are driven by the SMPL-X facial expression space, while ID-dependent joint and face offsets improve co-registration and mitigate artifacts through connectivity regularizers. The approach uses two MLPs to regress Gaussian attributes from per-vertex triplane features, with pose-conditioned and pose-free pathways, and renders via 3DGS with carefully designed losses, including a dedicated face loss. Experiments on NeuMan and X-Humans demonstrate state-of-the-art realism in faces and hands and robustness to limited pose diversity, enabling expressive, monocular-video avatars suitable for real-time applications. Limitations include interior-mouth geometry and dynamic clothing, with future work proposed on priors-based facial canonicalization and relightability.

Abstract

Facial expression and hand motions are necessary to express our emotions and interact with the world. Nevertheless, most of the 3D human avatars modeled from a casually captured video only support body motions without facial expressions and hand motions.In this work, we present ExAvatar, an expressive whole-body 3D human avatar learned from a short monocular video. We design ExAvatar as a combination of the whole-body parametric mesh model (SMPL-X) and 3D Gaussian Splatting (3DGS). The main challenges are 1) a limited diversity of facial expressions and poses in the video and 2) the absence of 3D observations, such as 3D scans and RGBD images. The limited diversity in the video makes animations with novel facial expressions and poses non-trivial. In addition, the absence of 3D observations could cause significant ambiguity in human parts that are not observed in the video, which can result in noticeable artifacts under novel motions. To address them, we introduce our hybrid representation of the mesh and 3D Gaussians. Our hybrid representation treats each 3D Gaussian as a vertex on the surface with pre-defined connectivity information (i.e., triangle faces) between them following the mesh topology of SMPL-X. It makes our ExAvatar animatable with novel facial expressions by driven by the facial expression space of SMPL-X. In addition, by using connectivity-based regularizers, we significantly reduce artifacts in novel facial expressions and poses.
Paper Structure (22 sections, 9 equations, 13 figures, 7 tables)

This paper contains 22 sections, 9 equations, 13 figures, 7 tables.

Figures (13)

  • Figure 1: From (a) a monocular video from a single person, we create our (b) ExAvatar, an expressive whole-body 3D avatar, animatable with novel facial expression code, hand poses, and body poses of SMPL-X.
  • Figure 2: The effectiveness of our joint offset $\Delta \mathbf{J}$ and face offset $\Delta \mathbf{V}_\text{face}$. They are necessary for the accurate registration of hands and face, which results in accurate co-registration of the whole body.
  • Figure 3: Without the face offset $\Delta \mathbf{V}_\text{face}$, the final 3D geometry of the avatar becomes totally inauthentic and inaccurate. For each setting, normals of 3D Gaussian points and colors are used for the rendering.
  • Figure 4: The architecture of our ExAvatar. From the canonical mesh $\bar{\mathbf{V}}$, triplane $\mathbf{T}$, per-vertex normal, and 3D pose $\theta$, we build a 3D animatable avatar. Then, with driving signals, 3D pose $\theta$ and facial expression code $\psi$ of SMPL-X pavlakos2019expressive, we animate the avatar and render it to the screen space with 3DGS kerbl20233d. For the normal rendering, we calculate the normal vectors using the positions of 3D Gaussian points and mesh topology of SMPL-X.
  • Figure 5: Our animated expressive whole-body avatars, made from monocular videos of NeuMan dataset jiang2022neuman. Avatars of each row are animated with the same facial expression code $\psi$ and 3D pose $\theta$ of SMPL-X.
  • ...and 8 more figures