Perceptually Accurate 3D Talking Head Generation: New Definitions, Speech-Mesh Representation, and Evaluation Metrics

Lee Chae-Yeon; Oh Hyun-Bin; Han EunGi; Kim Sung-Bin; Suekyeong Nam; Tae-Hyun Oh

Perceptually Accurate 3D Talking Head Generation: New Definitions, Speech-Mesh Representation, and Evaluation Metrics

Lee Chae-Yeon, Oh Hyun-Bin, Han EunGi, Kim Sung-Bin, Suekyeong Nam, Tae-Hyun Oh

TL;DR

The paper addresses perceptual lip synchronization in 3D talking head generation by defining three criteria—Temporal Synchronization, Lip Readability, and Expressiveness—and proposing a speech-mesh synchronized representation learned via a two-stage training pipeline. It first builds a robust 2D audio-visual speech space and then aligns 3D meshes to that space using contrastive learning, enabling a plug-in perceptual loss for existing models. Three new evaluation metrics (MTM, PLRS, SLCC) quantify temporal accuracy, perceptual lip readability, and expressive alignment, and large-scale datasets (LRS3-3D, MEAD-3D) support diverse speech and motion ranges. Empirical results show consistent improvements across metrics and human studies validate perceptual gains, with MEAD-3D contributing expressiveness when balanced with the perceptual loss.

Abstract

Recent advancements in speech-driven 3D talking head generation have made significant progress in lip synchronization. However, existing models still struggle to capture the perceptual alignment between varying speech characteristics and corresponding lip movements. In this work, we claim that three criteria -- Temporal Synchronization, Lip Readability, and Expressiveness -- are crucial for achieving perceptually accurate lip movements. Motivated by our hypothesis that a desirable representation space exists to meet these three criteria, we introduce a speech-mesh synchronized representation that captures intricate correspondences between speech signals and 3D face meshes. We found that our learned representation exhibits desirable characteristics, and we plug it into existing models as a perceptual loss to better align lip movements to the given speech. In addition, we utilize this representation as a perceptual metric and introduce two other physically grounded lip synchronization metrics to assess how well the generated 3D talking heads align with these three criteria. Experiments show that training 3D talking head generation models with our perceptual loss significantly improve all three aspects of perceptually accurate lip synchronization. Codes and datasets are available at https://perceptual-3d-talking-head.github.io/.

Perceptually Accurate 3D Talking Head Generation: New Definitions, Speech-Mesh Representation, and Evaluation Metrics

TL;DR

Abstract

Perceptually Accurate 3D Talking Head Generation: New Definitions, Speech-Mesh Representation, and Evaluation Metrics

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (13)