Table of Contents
Fetching ...

FreeAvatar: Robust 3D Facial Animation Transfer by Learning an Expression Foundation Model

Feng Qiu, Wei Zhang, Chen Liu, Rudong An, Lincheng Li, Yu Ding, Changjie Fan, Zhipeng Hu, Xin Yu

TL;DR

This work proposes a novel Expression-driven Multi-avatar Animator, which first maps expressive semantics to the facial control parameters of 3D avatars and then imposes perceptual constraints between the input and output images to maintain expression consistency.

Abstract

Video-driven 3D facial animation transfer aims to drive avatars to reproduce the expressions of actors. Existing methods have achieved remarkable results by constraining both geometric and perceptual consistency. However, geometric constraints (like those designed on facial landmarks) are insufficient to capture subtle emotions, while expression features trained on classification tasks lack fine granularity for complex emotions. To address this, we propose \textbf{FreeAvatar}, a robust facial animation transfer method that relies solely on our learned expression representation. Specifically, FreeAvatar consists of two main components: the expression foundation model and the facial animation transfer model. In the first component, we initially construct a facial feature space through a face reconstruction task and then optimize the expression feature space by exploring the similarities among different expressions. Benefiting from training on the amounts of unlabeled facial images and re-collected expression comparison dataset, our model adapts freely and effectively to any in-the-wild input facial images. In the facial animation transfer component, we propose a novel Expression-driven Multi-avatar Animator, which first maps expressive semantics to the facial control parameters of 3D avatars and then imposes perceptual constraints between the input and output images to maintain expression consistency. To make the entire process differentiable, we employ a trained neural renderer to translate rig parameters into corresponding images. Furthermore, unlike previous methods that require separate decoders for each avatar, we propose a dynamic identity injection module that allows for the joint training of multiple avatars within a single network.

FreeAvatar: Robust 3D Facial Animation Transfer by Learning an Expression Foundation Model

TL;DR

This work proposes a novel Expression-driven Multi-avatar Animator, which first maps expressive semantics to the facial control parameters of 3D avatars and then imposes perceptual constraints between the input and output images to maintain expression consistency.

Abstract

Video-driven 3D facial animation transfer aims to drive avatars to reproduce the expressions of actors. Existing methods have achieved remarkable results by constraining both geometric and perceptual consistency. However, geometric constraints (like those designed on facial landmarks) are insufficient to capture subtle emotions, while expression features trained on classification tasks lack fine granularity for complex emotions. To address this, we propose \textbf{FreeAvatar}, a robust facial animation transfer method that relies solely on our learned expression representation. Specifically, FreeAvatar consists of two main components: the expression foundation model and the facial animation transfer model. In the first component, we initially construct a facial feature space through a face reconstruction task and then optimize the expression feature space by exploring the similarities among different expressions. Benefiting from training on the amounts of unlabeled facial images and re-collected expression comparison dataset, our model adapts freely and effectively to any in-the-wild input facial images. In the facial animation transfer component, we propose a novel Expression-driven Multi-avatar Animator, which first maps expressive semantics to the facial control parameters of 3D avatars and then imposes perceptual constraints between the input and output images to maintain expression consistency. To make the entire process differentiable, we employ a trained neural renderer to translate rig parameters into corresponding images. Furthermore, unlike previous methods that require separate decoders for each avatar, we propose a dynamic identity injection module that allows for the joint training of multiple avatars within a single network.
Paper Structure (40 sections, 10 equations, 9 figures)

This paper contains 40 sections, 10 equations, 9 figures.

Figures (9)

  • Figure 1: Pipeline overview. FreeAvatar first constructs an expression foundation model in two steps: facial feature space construction with Masked Autoencoder (MAE) and expression feature space optimization via contrastive learning. After that, an Expression-driven Multi-avatar Animator is constructed to encode the expression representations into rig parameters. Then, perceptual constraints are employed in a differentiable manner to ensure that the expressions between the input and the avatars remain consistent.
  • Figure 2: Comparisons with Faceware. Our method captures more detailed expressions, particularly the mouth movements during dialogue.
  • Figure 3: Comparisons with MetaHuman Animator. Our FreeAvatar demonstrates enhanced robustness in non-ideal conditions.
  • Figure 4: Distribution of user study results comparing the expression consistency between our method and commercial methods, Faceware and MetaHuman Animator. The results demonstrate that our FreeAvatar significantly outperforms the others.
  • Figure 5: Comparisons with the state-of-the-art monocular face reconstruction methods in terms of maintaining consistent expressions. In some cases, HRN and FaceVerse fail to generate results due to the inability to detect facial landmarks. Portrait images are selected from FEC database ©Google AI (CC-0). Cartoon characters ©NetEase.
  • ...and 4 more figures