Table of Contents
Fetching ...

FlowVQTalker: High-Quality Emotional Talking Face Generation through Normalizing Flow and Quantization

Shuai Tan, Bin Ji, Ye Pan

TL;DR

FlowVQTalker addresses the challenge of producing emotionally expressive talking faces that are both lip-synchronous and texture-rich. It introduces a flow-based coefficient generator that maps 3D Morphable Model coefficients $(\beta, \rho)$ to a latent space modeled as a Student's $t$ mixture, enabling diverse nonverbal dynamics guided by audio and emotion context, while a Vector-Quantized Image Generator relies on a texture codebook to render high-definition expressions and teeth. The two modules are tightly integrated via the 3DMM representation, with emotion transfer enabled by emotion references and optional sampling from the latent distribution. Experimental results on MEAD and HDTF show superior quantitative and qualitative performance, including clearer textures and realistic blinks, demonstrating the method’s potential for realistic, controllable emotional talking-face synthesis and practical deployment in avatars and virtual humans.

Abstract

Generating emotional talking faces is a practical yet challenging endeavor. To create a lifelike avatar, we draw upon two critical insights from a human perspective: 1) The connection between audio and the non-deterministic facial dynamics, encompassing expressions, blinks, poses, should exhibit synchronous and one-to-many mapping. 2) Vibrant expressions are often accompanied by emotion-aware high-definition (HD) textures and finely detailed teeth. However, both aspects are frequently overlooked by existing methods. To this end, this paper proposes using normalizing Flow and Vector-Quantization modeling to produce emotional talking faces that satisfy both insights concurrently (FlowVQTalker). Specifically, we develop a flow-based coefficient generator that encodes the dynamics of facial emotion into a multi-emotion-class latent space represented as a mixture distribution. The generation process commences with random sampling from the modeled distribution, guided by the accompanying audio, enabling both lip-synchronization and the uncertain nonverbal facial cues generation. Furthermore, our designed vector-quantization image generator treats the creation of expressive facial images as a code query task, utilizing a learned codebook to provide rich, high-quality textures that enhance the emotional perception of the results. Extensive experiments are conducted to showcase the effectiveness of our approach.

FlowVQTalker: High-Quality Emotional Talking Face Generation through Normalizing Flow and Quantization

TL;DR

FlowVQTalker addresses the challenge of producing emotionally expressive talking faces that are both lip-synchronous and texture-rich. It introduces a flow-based coefficient generator that maps 3D Morphable Model coefficients to a latent space modeled as a Student's mixture, enabling diverse nonverbal dynamics guided by audio and emotion context, while a Vector-Quantized Image Generator relies on a texture codebook to render high-definition expressions and teeth. The two modules are tightly integrated via the 3DMM representation, with emotion transfer enabled by emotion references and optional sampling from the latent distribution. Experimental results on MEAD and HDTF show superior quantitative and qualitative performance, including clearer textures and realistic blinks, demonstrating the method’s potential for realistic, controllable emotional talking-face synthesis and practical deployment in avatars and virtual humans.

Abstract

Generating emotional talking faces is a practical yet challenging endeavor. To create a lifelike avatar, we draw upon two critical insights from a human perspective: 1) The connection between audio and the non-deterministic facial dynamics, encompassing expressions, blinks, poses, should exhibit synchronous and one-to-many mapping. 2) Vibrant expressions are often accompanied by emotion-aware high-definition (HD) textures and finely detailed teeth. However, both aspects are frequently overlooked by existing methods. To this end, this paper proposes using normalizing Flow and Vector-Quantization modeling to produce emotional talking faces that satisfy both insights concurrently (FlowVQTalker). Specifically, we develop a flow-based coefficient generator that encodes the dynamics of facial emotion into a multi-emotion-class latent space represented as a mixture distribution. The generation process commences with random sampling from the modeled distribution, guided by the accompanying audio, enabling both lip-synchronization and the uncertain nonverbal facial cues generation. Furthermore, our designed vector-quantization image generator treats the creation of expressive facial images as a code query task, utilizing a learned codebook to provide rich, high-quality textures that enhance the emotional perception of the results. Extensive experiments are conducted to showcase the effectiveness of our approach.
Paper Structure (15 sections, 18 equations, 7 figures, 2 tables)

This paper contains 15 sections, 18 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Example animations produced by FlowVQTalker. Given a source image and a driving audio, FlowVQTalker creates talking face video, complete with diverse and synchronous facial dynamic, emotion-aware HD texture and fine-grained clear teeth.
  • Figure 2: The overview of our proposed FlowVQTalker. (a) Main pipeline. Given a source image $I_0$, audio $a$ and emotion label $e$, Flow-based Coeff. Generator ( \ref{['sec:3.2']}) which consists of ExpFlow and PoseFlow generates synchronized emotional coefficients $(\hat{\beta_e}, \hat{\rho_e})$. ExpFlow also supports emotion transfer by inputting an additional emotion reference $I^r_e$. Subsequently, Vector-Quantized Image Generator (\ref{['sec:3.3']}) takes as input the generated coefficients $(\hat{\beta_e}, \hat{\rho_e})$ and source image $I_0$, producing emotional talking face frames $\hat{I}_e$. (b) The framework of ExpFlow and train/inference workflow. ExpFlow is composed of $K$ flow steps, each containing three subsections: multi-actnorm, invertible convolution, and an affine coupling layer. Based on the property of normalizing flow rezende2015variational, our proposed ExpFlow is bijective. To provide a clearer understanding, we separately represent the training (left) and inference (right) workflows, but note that they run on the same network structure. During training, the ground truth emotional coefficient $\beta_e$ is passed through the forward process (left) of ExpFlow, leading to the conversion into a latent code $z$, which is then mapped into the SMM latent space using Maximum Likelihood Estimation (MLE). During inference, we sample $\hat{z}$ from the distribution corresponding to emotion $e$, which is then fed into the reverse process (right) of ExpFlow to generate $\hat{\beta}_e$. Besides, as indicated by the dotted arrows, $\hat{z}$ can also be obtained by inputting emotion reference $\beta^r_e$ into forward process of ExpFlow, achieving optional emotion transfer. Both training and inference are conditioned on contextual information, including the source coefficient $\beta_0$, driving audio $a$, previous coefficients $\beta^{pre}_e = \beta^{t-\tau:t-1}_e$ and emotion label $e$, where $t$ and $\tau$ refer to current frame index and previous frame length. Note that we employ an audio encoder and an emotion encoder to obtain corresponding features from $a$ and $e$ before the concatenation operator, and the detailed structures are in supplementary material.
  • Figure 3: The structure of Vector-Quantized Image Generator (VQIG). (a) Codebook. Initially, we train a high-quality image encoder $E_h$, codebook $\mathcal{C}$ and image decoder $D_h$ using a self-reconstruction strategy. Once converged, we freeze the parameters of ${E_h}$, $\mathcal{C}$ and $D_h$. (b) VQIG. $I_0$ are encoded as a discrete representation $z_c$ to capture identity and texture information. $(\hat{\beta}_e, \hat{\rho}_e)$ generated by \ref{['sec:3.2']} are mapped into $\sigma$ by $\psi$, which is then used to warp $I_0$ into $I_w$. An additional warped image encoder $E_w$ is introduced to encode $I_w$ as $z_w$. We employ $\varphi$ to derive $z_f$ from $z_w$ and $z_c$. Along with $z_c$, both are fed into the multi-head cross-attention module (MHCA) and a transformer $T$. The generated vectors retrieve the best-matching features from the Codebook $\mathcal{C}$. The final result is rendered through $D_h$. It is worth noting that we introduce ADAIN at both feature and image levels to incorporate additional spatial information.
  • Figure 4: Qualitative comparisons with state-of-the-art methods. See full comparison in supplementary material (Suppl).
  • Figure 5: Comparison with SOTAs in terms of generated poses.
  • ...and 2 more figures