Table of Contents
Fetching ...

Diverse Code Query Learning for Speech-Driven Facial Animation

Chunzhi Gu, Shigeru Kuriyama, Katsuya Hotta

TL;DR

Diverse Code Query Learning for Speech-Driven Facial Animation introduces a stochastic, multi-sample framework for audio-driven 3D facial animation. It builds region-specific VQ-VAE priors for lips and upper-face, and learns a diverse code querying mechanism that generates multiple latent codes per timestep under a diversity-promoting loss, while employing DTW-based pseudo supervision to guide varied yet plausible motions. A partial controllability mechanism sequentially predicts lip codes followed by upper-face codes, using cross-attention and a style token to ensure coherence and controllability across facial parts. Empirical results on BIWI and VOCASET demonstrate state-of-the-art diversity and competitive realism, supported by user studies; limitations include occasional eye-motion artifacts and a focus on external facial geometry rather to inner mouth dynamics.

Abstract

Speech-driven facial animation aims to synthesize lip-synchronized 3D talking faces following the given speech signal. Prior methods to this task mostly focus on pursuing realism with deterministic systems, yet characterizing the potentially stochastic nature of facial motions has been to date rarely studied. While generative modeling approaches can easily handle the one-to-many mapping by repeatedly drawing samples, ensuring a diverse mode coverage of plausible facial motions on small-scale datasets remains challenging and less explored. In this paper, we propose predicting multiple samples conditioned on the same audio signal and then explicitly encouraging sample diversity to address diverse facial animation synthesis. Our core insight is to guide our model to explore the expressive facial latent space with a diversity-promoting loss such that the desired latent codes for diversification can be ideally identified. To this end, building upon the rich facial prior learned with vector-quantized variational auto-encoding mechanism, our model temporally queries multiple stochastic codes which can be flexibly decoded into a diverse yet plausible set of speech-faithful facial motions. To further allow for control over different facial parts during generation, the proposed model is designed to predict different facial portions of interest in a sequential manner, and compose them to eventually form full-face motions. Our paradigm realizes both diverse and controllable facial animation synthesis in a unified formulation. We experimentally demonstrate that our method yields state-of-the-art performance both quantitatively and qualitatively, especially regarding sample diversity.

Diverse Code Query Learning for Speech-Driven Facial Animation

TL;DR

Diverse Code Query Learning for Speech-Driven Facial Animation introduces a stochastic, multi-sample framework for audio-driven 3D facial animation. It builds region-specific VQ-VAE priors for lips and upper-face, and learns a diverse code querying mechanism that generates multiple latent codes per timestep under a diversity-promoting loss, while employing DTW-based pseudo supervision to guide varied yet plausible motions. A partial controllability mechanism sequentially predicts lip codes followed by upper-face codes, using cross-attention and a style token to ensure coherence and controllability across facial parts. Empirical results on BIWI and VOCASET demonstrate state-of-the-art diversity and competitive realism, supported by user studies; limitations include occasional eye-motion artifacts and a focus on external facial geometry rather to inner mouth dynamics.

Abstract

Speech-driven facial animation aims to synthesize lip-synchronized 3D talking faces following the given speech signal. Prior methods to this task mostly focus on pursuing realism with deterministic systems, yet characterizing the potentially stochastic nature of facial motions has been to date rarely studied. While generative modeling approaches can easily handle the one-to-many mapping by repeatedly drawing samples, ensuring a diverse mode coverage of plausible facial motions on small-scale datasets remains challenging and less explored. In this paper, we propose predicting multiple samples conditioned on the same audio signal and then explicitly encouraging sample diversity to address diverse facial animation synthesis. Our core insight is to guide our model to explore the expressive facial latent space with a diversity-promoting loss such that the desired latent codes for diversification can be ideally identified. To this end, building upon the rich facial prior learned with vector-quantized variational auto-encoding mechanism, our model temporally queries multiple stochastic codes which can be flexibly decoded into a diverse yet plausible set of speech-faithful facial motions. To further allow for control over different facial parts during generation, the proposed model is designed to predict different facial portions of interest in a sequential manner, and compose them to eventually form full-face motions. Our paradigm realizes both diverse and controllable facial animation synthesis in a unified formulation. We experimentally demonstrate that our method yields state-of-the-art performance both quantitatively and qualitatively, especially regarding sample diversity.
Paper Structure (13 sections, 13 equations, 8 figures, 6 tables)

This paper contains 13 sections, 13 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Diverse (a) and controllable (b) facial motion synthesis. In the controllable setting (b), all samples have strictly fixed lip motions (blue dotted area) but with diverse upper-face variations.
  • Figure 2: Codebook pair learning with VQ-VAEs for lip (bottom) and upper-face areas (top). $*\in\{u,l\}$ refers to upper-face or lip, respectively.
  • Figure 3: Illustration of Closure-aware masking. The mask prevents the diversification from being promoted over the sounds with closed lip movements.
  • Figure 4: Method overview of CDFace. Our method sequentially predicts diverse codes for the Lip- (L) and Upper-face (U)-areas, in (a) and (b) respectively, using the encoded audio embedding in (c).
  • Figure 5: Diverse synthesis on VOCASET-Test against FaceFormer. For each syllable, we display three samples from CDFace and FaceDiffuser, respectively.
  • ...and 3 more figures