Table of Contents
Fetching ...

Synthetic Prior for Few-Shot Drivable Head Avatar Inversion

Wojciech Zielonka, Stephan J. Garbin, Alexandros Lattas, George Kopanas, Paulo Gotardo, Thabo Beeler, Justus Thies, Timo Bolkart

TL;DR

SynShot tackles few-shot drivable head avatar inversion by learning a synthetic head prior trained on a large synthetic dataset and adapting it to real subjects with only three input images. The core idea is a Gaussian-prior model that uses a Drivable VQ-VAE to output per-part Gaussian primitives and UV-space feature maps, refined via part-specific regressors. A two-stage pivotal-tuning inversion bridges the synthetic-to-real domain, enabling dramatic improvements in novel-view and novel-expression synthesis while reducing data and privacy burdens. Empirically, SynShot outperforms state-of-the-art monocular and inversion-based baselines, validating synthetic priors as a practical route for private, scalable drivable head avatars.

Abstract

We present SynShot, a novel method for the few-shot inversion of a drivable head avatar based on a synthetic prior. We tackle three major challenges. First, training a controllable 3D generative network requires a large number of diverse sequences, for which pairs of images and high-quality tracked meshes are not always available. Second, the use of real data is strictly regulated (e.g., under the General Data Protection Regulation, which mandates frequent deletion of models and data to accommodate a situation when a participant's consent is withdrawn). Synthetic data, free from these constraints, is an appealing alternative. Third, state-of-the-art monocular avatar models struggle to generalize to new views and expressions, lacking a strong prior and often overfitting to a specific viewpoint distribution. Inspired by machine learning models trained solely on synthetic data, we propose a method that learns a prior model from a large dataset of synthetic heads with diverse identities, expressions, and viewpoints. With few input images, SynShot fine-tunes the pretrained synthetic prior to bridge the domain gap, modeling a photorealistic head avatar that generalizes to novel expressions and viewpoints. We model the head avatar using 3D Gaussian splatting and a convolutional encoder-decoder that outputs Gaussian parameters in UV texture space. To account for the different modeling complexities over parts of the head (e.g., skin vs hair), we embed the prior with explicit control for upsampling the number of per-part primitives. Compared to SOTA monocular and GAN-based methods, SynShot significantly improves novel view and expression synthesis.

Synthetic Prior for Few-Shot Drivable Head Avatar Inversion

TL;DR

SynShot tackles few-shot drivable head avatar inversion by learning a synthetic head prior trained on a large synthetic dataset and adapting it to real subjects with only three input images. The core idea is a Gaussian-prior model that uses a Drivable VQ-VAE to output per-part Gaussian primitives and UV-space feature maps, refined via part-specific regressors. A two-stage pivotal-tuning inversion bridges the synthetic-to-real domain, enabling dramatic improvements in novel-view and novel-expression synthesis while reducing data and privacy burdens. Empirically, SynShot outperforms state-of-the-art monocular and inversion-based baselines, validating synthetic priors as a practical route for private, scalable drivable head avatars.

Abstract

We present SynShot, a novel method for the few-shot inversion of a drivable head avatar based on a synthetic prior. We tackle three major challenges. First, training a controllable 3D generative network requires a large number of diverse sequences, for which pairs of images and high-quality tracked meshes are not always available. Second, the use of real data is strictly regulated (e.g., under the General Data Protection Regulation, which mandates frequent deletion of models and data to accommodate a situation when a participant's consent is withdrawn). Synthetic data, free from these constraints, is an appealing alternative. Third, state-of-the-art monocular avatar models struggle to generalize to new views and expressions, lacking a strong prior and often overfitting to a specific viewpoint distribution. Inspired by machine learning models trained solely on synthetic data, we propose a method that learns a prior model from a large dataset of synthetic heads with diverse identities, expressions, and viewpoints. With few input images, SynShot fine-tunes the pretrained synthetic prior to bridge the domain gap, modeling a photorealistic head avatar that generalizes to novel expressions and viewpoints. We model the head avatar using 3D Gaussian splatting and a convolutional encoder-decoder that outputs Gaussian parameters in UV texture space. To account for the different modeling complexities over parts of the head (e.g., skin vs hair), we embed the prior with explicit control for upsampling the number of per-part primitives. Compared to SOTA monocular and GAN-based methods, SynShot significantly improves novel view and expression synthesis.
Paper Structure (28 sections, 5 equations, 20 figures, 2 tables)

This paper contains 28 sections, 5 equations, 20 figures, 2 tables.

Figures (20)

  • Figure 1: Pipeline overview. Given an extracted texture $\pmb{x}_{\mathrm{tex}}\xspace$, rasterized position map $\pmb{x}_{\mathrm{verts}}\xspace$, and PCA expression deltas $\pmb{x}_{\mathrm{exp}}\xspace$ our network utilized VQ-VAE to jointly optimize for two latent space $\pmb{z}\xspace_{\mathrm{expr}}\xspace$ and $\pmb{z}\xspace_{\mathrm{id}}\xspace$. The VQ-VAE decoders predict feature map $\pmb{D}_{\mathrm{feat}}\xspace(\mathbf{q}\xspace(\pmb{z}\xspace_{\mathrm{id}}\xspace), \mathbf{q}\xspace(\pmb{z}\xspace_{\mathrm{expr}}\xspace)) \rightarrow \hat{\pmb{x}}_{\mathrm{feat}}\xspace$, identity and color maps $\pmb{D}_{\mathrm{id}}\xspace(\mathbf{q}\xspace(\pmb{z}\xspace_{\mathrm{id}}\xspace))\rightarrow\{\hat{\pmb{x}}_{\mathrm{tex}}\xspace, \hat{\pmb{x}}_{\mathrm{verts}}\xspace\}$, and expression deltas $\pmb{D}_{\mathrm{expr}}\xspace(\mathbf{q}\xspace(\pmb{z}\xspace_{\mathrm{expr}}\xspace)) \rightarrow \hat{\pmb{x}}_{\mathrm{expr}}\xspace$. Finally, bilinearly sampled maps are passed to per-part regressors $\pmb{R}_{color}\xspace$ and $\pmb{R}_{gauss}\xspace$ to obtain primitives to rasterize.
  • Figure 2: Result of the pivotal tuning before (Prior) and after fine-tuning the model decoders and regressors (Final).
  • Figure 3: Random samples of our synthetic dataset show a diverse range of identities, expressions, and hairstyles that would be challenging to capture in an in-house studio with real subjects.
  • Figure 4: Cross-reenactment comparison of SynShot inversion using only 3 views to state-of-the-art (SOTA) methods: INSTAZielonka2022InstantVH, Flash Avatar (FA) xiang2024flashavatar, and Splatting Avatar (SA) shao2024splattingavatar, each of which was trained on an average of 3000 frames. It is evident that without a strong prior, these methods fail to generalize to novel expressions and views. Inversion input images are in the supplemental materials.
  • Figure 5: Novel view evaluation of long hair and beard inversion using only three input images demonstrates the strong generalization capability of SynShot.
  • ...and 15 more figures