Table of Contents
Fetching ...

GaussianHeads: End-to-End Learning of Drivable Gaussian Head Avatars from Coarse-to-fine Representations

Kartik Teotia, Hyeongwoo Kim, Pablo Garrido, Marc Habermann, Mohamed Elgharib, Christian Theobalt

TL;DR

This work proposes a new method to generate highly dynamic and deformable human head avatars from multi-view imagery in real time, using a hierarchical representation of head models that can capture the complex dynamics of facial expressions and head movements.

Abstract

Real-time rendering of human head avatars is a cornerstone of many computer graphics applications, such as augmented reality, video games, and films, to name a few. Recent approaches address this challenge with computationally efficient geometry primitives in a carefully calibrated multi-view setup. Albeit producing photorealistic head renderings, it often fails to represent complex motion changes such as the mouth interior and strongly varying head poses. We propose a new method to generate highly dynamic and deformable human head avatars from multi-view imagery in real-time. At the core of our method is a hierarchical representation of head models that allows to capture the complex dynamics of facial expressions and head movements. First, with rich facial features extracted from raw input frames, we learn to deform the coarse facial geometry of the template mesh. We then initialize 3D Gaussians on the deformed surface and refine their positions in a fine step. We train this coarse-to-fine facial avatar model along with the head pose as a learnable parameter in an end-to-end framework. This enables not only controllable facial animation via video inputs, but also high-fidelity novel view synthesis of challenging facial expressions, such as tongue deformations and fine-grained teeth structure under large motion changes. Moreover, it encourages the learned head avatar to generalize towards new facial expressions and head poses at inference time. We demonstrate the performance of our method with comparisons against the related methods on different datasets, spanning challenging facial expression sequences across multiple identities. We also show the potential application of our approach by demonstrating a cross-identity facial performance transfer application.

GaussianHeads: End-to-End Learning of Drivable Gaussian Head Avatars from Coarse-to-fine Representations

TL;DR

This work proposes a new method to generate highly dynamic and deformable human head avatars from multi-view imagery in real time, using a hierarchical representation of head models that can capture the complex dynamics of facial expressions and head movements.

Abstract

Real-time rendering of human head avatars is a cornerstone of many computer graphics applications, such as augmented reality, video games, and films, to name a few. Recent approaches address this challenge with computationally efficient geometry primitives in a carefully calibrated multi-view setup. Albeit producing photorealistic head renderings, it often fails to represent complex motion changes such as the mouth interior and strongly varying head poses. We propose a new method to generate highly dynamic and deformable human head avatars from multi-view imagery in real-time. At the core of our method is a hierarchical representation of head models that allows to capture the complex dynamics of facial expressions and head movements. First, with rich facial features extracted from raw input frames, we learn to deform the coarse facial geometry of the template mesh. We then initialize 3D Gaussians on the deformed surface and refine their positions in a fine step. We train this coarse-to-fine facial avatar model along with the head pose as a learnable parameter in an end-to-end framework. This enables not only controllable facial animation via video inputs, but also high-fidelity novel view synthesis of challenging facial expressions, such as tongue deformations and fine-grained teeth structure under large motion changes. Moreover, it encourages the learned head avatar to generalize towards new facial expressions and head poses at inference time. We demonstrate the performance of our method with comparisons against the related methods on different datasets, spanning challenging facial expression sequences across multiple identities. We also show the potential application of our approach by demonstrating a cross-identity facial performance transfer application.
Paper Structure (29 sections, 11 equations, 15 figures, 6 tables)

This paper contains 29 sections, 11 equations, 15 figures, 6 tables.

Figures (15)

  • Figure 1: Method Overview. Multi-view driving images are provided as input, and an image encoder extracts the animation code ($\mathbf{Z}_{\text{exp}}$) and global head pose parameters ($\mathbf{R}_{\text{rigid}}\,\mathbf{T}_{\text{rigid}}$). The animation code is fed into a vertex deformer network to generate per-vertex displacements ($\delta_v$) relative to a canonical template mesh in a rest pose with vertices $v_{t}$, resulting in an expression-dependent deformed mesh with vertices $v_{d}$. This is then globally transformed to a posed mesh with vertices $v_{p}$ using the head pose parameters ($\mathbf{R}_{\text{rigid}}\,\mathbf{T}_{\text{rigid}}$). 3D Gaussians with positions ($\mathbf{p}_{\text{in}}$), scales ($\mathbf{s}_{\text{in}}$), and rotations ($\mathbf{r}_{\text{in}}$) are initialized on the globally transformed deformed mesh. The deltas decoder predicts deformation for position ($\delta_{p}$), rotation ($\delta_{r}$), and scale ($\delta_{s}$) to refine the initialized 3D Gaussians. We employ two decoders to predict per-Gaussian RGB color and opacity aligned to the UV map of the template. 3D Gaussian Splatting projects the deformed Gaussians into the image plane, resulting in the rendered image. The pipeline is trained end-to-end using multi-view RGB images, expression tracking data, and rigid head pose tracking data. During testing, only a feedforward pass of the input is required to drive the global rigid head pose and facial expressions.
  • Figure 2: Coarse-to-fine Learning. We refine the FLAME shape-fitted mesh by an initial registration step (b), which results in the template mesh with vertices $v_t$ used in our framework. (c) The template mesh is deformed based on the input and posed to the world space using the global transformation parameters. We initialize the 3D Gaussians on the rigidly translated deformed mesh with vertices and refine their properties through our Coarse-to-Fine framework, resulting in dense, head surface-aligned 3D Gaussians (d). The 3D Gaussians are splatted, resulting in the render (e), which is supervised with ground truth image (f).
  • Figure 3: Qualitative Results. Novel view synthesis at virtual camera paths for different subjects. Our method excels at representing fine details, such as facial hair and fine scalp hair strands.
  • Figure 4: Qualitative Results. Expression synthesis from a novel camera viewpoint. Our approach can synthesize challenging expressions and motions, such as the tongue sticking out.
  • Figure 5: Ablation Study.Left to right: Without coarse deformations, without fine deformations, without rigid landmark alignment loss $\mathbf{L}_{\text{lmk}}$, ours, and Ground Truth. The highlighted regions show expression alignment in the mouth region. We observe that our design choices result in high-quality details in the mouth interior. Note: Best viewed if zoomed in or in the supplementary video.
  • ...and 10 more figures