FlowVQTalker: High-Quality Emotional Talking Face Generation through Normalizing Flow and Quantization
Shuai Tan, Bin Ji, Ye Pan
TL;DR
FlowVQTalker addresses the challenge of producing emotionally expressive talking faces that are both lip-synchronous and texture-rich. It introduces a flow-based coefficient generator that maps 3D Morphable Model coefficients $(\beta, \rho)$ to a latent space modeled as a Student's $t$ mixture, enabling diverse nonverbal dynamics guided by audio and emotion context, while a Vector-Quantized Image Generator relies on a texture codebook to render high-definition expressions and teeth. The two modules are tightly integrated via the 3DMM representation, with emotion transfer enabled by emotion references and optional sampling from the latent distribution. Experimental results on MEAD and HDTF show superior quantitative and qualitative performance, including clearer textures and realistic blinks, demonstrating the method’s potential for realistic, controllable emotional talking-face synthesis and practical deployment in avatars and virtual humans.
Abstract
Generating emotional talking faces is a practical yet challenging endeavor. To create a lifelike avatar, we draw upon two critical insights from a human perspective: 1) The connection between audio and the non-deterministic facial dynamics, encompassing expressions, blinks, poses, should exhibit synchronous and one-to-many mapping. 2) Vibrant expressions are often accompanied by emotion-aware high-definition (HD) textures and finely detailed teeth. However, both aspects are frequently overlooked by existing methods. To this end, this paper proposes using normalizing Flow and Vector-Quantization modeling to produce emotional talking faces that satisfy both insights concurrently (FlowVQTalker). Specifically, we develop a flow-based coefficient generator that encodes the dynamics of facial emotion into a multi-emotion-class latent space represented as a mixture distribution. The generation process commences with random sampling from the modeled distribution, guided by the accompanying audio, enabling both lip-synchronization and the uncertain nonverbal facial cues generation. Furthermore, our designed vector-quantization image generator treats the creation of expressive facial images as a code query task, utilizing a learned codebook to provide rich, high-quality textures that enhance the emotional perception of the results. Extensive experiments are conducted to showcase the effectiveness of our approach.
