Table of Contents
Fetching ...

Perceptually Accurate 3D Talking Head Generation: New Definitions, Speech-Mesh Representation, and Evaluation Metrics

Lee Chae-Yeon, Oh Hyun-Bin, Han EunGi, Kim Sung-Bin, Suekyeong Nam, Tae-Hyun Oh

TL;DR

The paper addresses perceptual lip synchronization in 3D talking head generation by defining three criteria—Temporal Synchronization, Lip Readability, and Expressiveness—and proposing a speech-mesh synchronized representation learned via a two-stage training pipeline. It first builds a robust 2D audio-visual speech space and then aligns 3D meshes to that space using contrastive learning, enabling a plug-in perceptual loss for existing models. Three new evaluation metrics (MTM, PLRS, SLCC) quantify temporal accuracy, perceptual lip readability, and expressive alignment, and large-scale datasets (LRS3-3D, MEAD-3D) support diverse speech and motion ranges. Empirical results show consistent improvements across metrics and human studies validate perceptual gains, with MEAD-3D contributing expressiveness when balanced with the perceptual loss.

Abstract

Recent advancements in speech-driven 3D talking head generation have made significant progress in lip synchronization. However, existing models still struggle to capture the perceptual alignment between varying speech characteristics and corresponding lip movements. In this work, we claim that three criteria -- Temporal Synchronization, Lip Readability, and Expressiveness -- are crucial for achieving perceptually accurate lip movements. Motivated by our hypothesis that a desirable representation space exists to meet these three criteria, we introduce a speech-mesh synchronized representation that captures intricate correspondences between speech signals and 3D face meshes. We found that our learned representation exhibits desirable characteristics, and we plug it into existing models as a perceptual loss to better align lip movements to the given speech. In addition, we utilize this representation as a perceptual metric and introduce two other physically grounded lip synchronization metrics to assess how well the generated 3D talking heads align with these three criteria. Experiments show that training 3D talking head generation models with our perceptual loss significantly improve all three aspects of perceptually accurate lip synchronization. Codes and datasets are available at https://perceptual-3d-talking-head.github.io/.

Perceptually Accurate 3D Talking Head Generation: New Definitions, Speech-Mesh Representation, and Evaluation Metrics

TL;DR

The paper addresses perceptual lip synchronization in 3D talking head generation by defining three criteria—Temporal Synchronization, Lip Readability, and Expressiveness—and proposing a speech-mesh synchronized representation learned via a two-stage training pipeline. It first builds a robust 2D audio-visual speech space and then aligns 3D meshes to that space using contrastive learning, enabling a plug-in perceptual loss for existing models. Three new evaluation metrics (MTM, PLRS, SLCC) quantify temporal accuracy, perceptual lip readability, and expressive alignment, and large-scale datasets (LRS3-3D, MEAD-3D) support diverse speech and motion ranges. Empirical results show consistent improvements across metrics and human studies validate perceptual gains, with MEAD-3D contributing expressiveness when balanced with the perceptual loss.

Abstract

Recent advancements in speech-driven 3D talking head generation have made significant progress in lip synchronization. However, existing models still struggle to capture the perceptual alignment between varying speech characteristics and corresponding lip movements. In this work, we claim that three criteria -- Temporal Synchronization, Lip Readability, and Expressiveness -- are crucial for achieving perceptually accurate lip movements. Motivated by our hypothesis that a desirable representation space exists to meet these three criteria, we introduce a speech-mesh synchronized representation that captures intricate correspondences between speech signals and 3D face meshes. We found that our learned representation exhibits desirable characteristics, and we plug it into existing models as a perceptual loss to better align lip movements to the given speech. In addition, we utilize this representation as a perceptual metric and introduce two other physically grounded lip synchronization metrics to assess how well the generated 3D talking heads align with these three criteria. Experiments show that training 3D talking head generation models with our perceptual loss significantly improve all three aspects of perceptually accurate lip synchronization. Codes and datasets are available at https://perceptual-3d-talking-head.github.io/.

Paper Structure

This paper contains 55 sections, 12 equations, 13 figures, 12 tables, 1 algorithm.

Figures (13)

  • Figure 1: Pipeline of speech-mesh synchronized representation learning. We train our speech-mesh representation space in a two-stage manner. In the first stage, we learn a rich audio-visual representation in 2D domain to capture the synchronization between lip movement and speech. In the second stage, we train the 3D mesh encoder to align the 3D mesh space with the frozen speech space. As an application of our speech-mesh representation space, we propose a plug-in perceptual loss to 3D talking head models to enhance the quality of lip movements.
  • Figure 2: Qualitative results of the effectiveness of our perceptual loss for lip readability. Our perceptual loss guides baselines faceformercodetalkerselftalk to generate perceptually accurate lip movements.
  • Figure 3: t-SNE plot of ablation study. We plot the t-SNE graph for each perceptual critic model. We represent the features with same phoneme as same color. Squared and circled points denote mesh and speech features from each representation, respectively.
  • Figure 4: Behaviors of our representation in temporal and expressiveness sensitivity. We demonstrate the effectiveness of our representation in temporal synchronization and expressiveness using a cosine similarity graph and speech feature plots, respectively. We color the point as low, medium, and high intensity.
  • Figure 5: Qualitative results for the expressiveness. Given high and low intensity levels of speech, models trained on both MEAD-3D and VOCASET show more expressive lip movements compared to those trained on VOCASET alone, and even better with our perceptual loss.
  • ...and 8 more figures