Table of Contents
Fetching ...

3D Gaussian Blendshapes for Head Avatar Animation

Shengjie Ma, Yanlin Weng, Tianjia Shao, Kun Zhou

TL;DR

This paper introduces a 3D Gaussian blendshape representation for head avatars that unifies classic mesh-based semantics with a Gaussian splatting renderer. By modeling a neutral base B0, multiple expression blendshapes ΔBk, and a mouth interior set Bm as Gaussians, and by linearly blending these with expression coefficients ψ and applying linear blend skinning with Θ, the approach achieves real-time, high-fidelity avatar animation from monocular video. A key contribution is enforcing semantic consistency between Gaussian blendshapes and their mesh counterparts via an intermediate variable that links Gaussian differences to mesh deformations, which improves generalization to novel expressions. The method outperforms state-of-the-art NeRF- and point-based approaches in quality metrics and runs at up to 370fps with tens of thousands of Gaussians, offering a practical solution for photoreal telepresence and AR/VR applications while acknowledging limitations such as side-view generalization and potential misuse.

Abstract

We introduce 3D Gaussian blendshapes for modeling photorealistic head avatars. Taking a monocular video as input, we learn a base head model of neutral expression, along with a group of expression blendshapes, each of which corresponds to a basis expression in classical parametric face models. Both the neutral model and expression blendshapes are represented as 3D Gaussians, which contain a few properties to depict the avatar appearance. The avatar model of an arbitrary expression can be effectively generated by combining the neutral model and expression blendshapes through linear blending of Gaussians with the expression coefficients. High-fidelity head avatar animations can be synthesized in real time using Gaussian splatting. Compared to state-of-the-art methods, our Gaussian blendshape representation better captures high-frequency details exhibited in input video, and achieves superior rendering performance.

3D Gaussian Blendshapes for Head Avatar Animation

TL;DR

This paper introduces a 3D Gaussian blendshape representation for head avatars that unifies classic mesh-based semantics with a Gaussian splatting renderer. By modeling a neutral base B0, multiple expression blendshapes ΔBk, and a mouth interior set Bm as Gaussians, and by linearly blending these with expression coefficients ψ and applying linear blend skinning with Θ, the approach achieves real-time, high-fidelity avatar animation from monocular video. A key contribution is enforcing semantic consistency between Gaussian blendshapes and their mesh counterparts via an intermediate variable that links Gaussian differences to mesh deformations, which improves generalization to novel expressions. The method outperforms state-of-the-art NeRF- and point-based approaches in quality metrics and runs at up to 370fps with tens of thousands of Gaussians, offering a practical solution for photoreal telepresence and AR/VR applications while acknowledging limitations such as side-view generalization and potential misuse.

Abstract

We introduce 3D Gaussian blendshapes for modeling photorealistic head avatars. Taking a monocular video as input, we learn a base head model of neutral expression, along with a group of expression blendshapes, each of which corresponds to a basis expression in classical parametric face models. Both the neutral model and expression blendshapes are represented as 3D Gaussians, which contain a few properties to depict the avatar appearance. The avatar model of an arbitrary expression can be effectively generated by combining the neutral model and expression blendshapes through linear blending of Gaussians with the expression coefficients. High-fidelity head avatar animations can be synthesized in real time using Gaussian splatting. Compared to state-of-the-art methods, our Gaussian blendshape representation better captures high-frequency details exhibited in input video, and achieves superior rendering performance.
Paper Structure (21 sections, 6 equations, 12 figures, 3 tables)

This paper contains 21 sections, 6 equations, 12 figures, 3 tables.

Figures (12)

  • Figure 1: Overview of our method. Taking a monocular video as input, our method learns a Gaussian blendshape representation of a head avatar, which consists of a neutral model $B_0$, a group of expression blendshapes $\{B_1,B_2,...,B_K\}$, and the mouth interior model $B_m$, all represented as 3D Gaussians. Avatar models of arbitrary expressions and poses can be generated by linear blending with expression coefficients $\{\psi'_k\}$ and linear blend skinning with joint and pose parameters $\Theta'$, from which we render high-fidelity images in real time using Gaussian splatting.
  • Figure 2: The impact of the blendshape consistency on the optimization of expression blendshapes. The first row shows the displacement magnitude between $M_k$ and $M_0$. The second and the third rows show the magnitude of optimized $\Delta B_k$ with or without blendshape consistency.
  • Figure 3: Ablation study on blendshape consistency. The optimization without blendshape consistency leads to apparent artifacts like dirty color and glitch in both interior and boundary areas. Enforcing blendshape consistency only on Gaussian positions also leads to poor results.
  • Figure 4: Qualitative comparisons with NeRFBlendShape gao2022reconstructing. Our method more faithfully captures fine facial details (e.g., wrinkles around the eyes and nose), and better recovers the eyeball movement. YouTube video ID is -yHgE9W699w for Hillary Clinton.
  • Figure 5: Qualitative comparisons for novel view extrapolation. Our method produces better results with fine details under novel views.
  • ...and 7 more figures