Table of Contents
Fetching ...

The impact of differences in facial features between real speakers and 3D face models on synthesized lip motions

Rabab Algadhy, Yoshihiko Gotoh, Steve Maddock

TL;DR

The paper addresses how mismatches between real facial features and 3D morphable representations influence lip-synchronization in 3D facial animation. It builds per-speaker 3DMMs from synthetic head poses, maps 2D lip-motion data to 3D landmarks, and evaluates effects using both quantitative RMSE metrics and qualitative human judgments. Key findings show that mismatches in mouth height (index 7) substantially degrade lip motion, while mouth width (index 10) yields more variable results depending on lip thickness and facial geometry; these insights guide when to pair real actors with non-corresponding 3D faces. The study provides practical guidelines for animation and training systems, highlighting the need to consider facial-feature alignment when driving 3D characters from 2D lip recordings and suggesting future work to expand analyses to additional facial proportions.

Abstract

Lip motion accuracy is important for speech intelligibility, especially for users who are hard of hearing or second language learners. A high level of realism in lip movements is also required for the game and film production industries. 3D morphable models (3DMMs) have been widely used for facial analysis and animation. However, factors that could influence their use in facial animation, such as the differences in facial features between recorded real faces and animated synthetic faces, have not been given adequate attention. This paper investigates the mapping between real speakers and similar and non-similar 3DMMs and the impact on the resulting 3D lip motion. Mouth height and mouth width are used to determine face similarity. The results show that mapping 2D videos of real speakers with low mouth heights to 3D heads that correspond to real speakers with high mouth heights, or vice versa, generates less good 3D lip motion. It is thus important that such a mismatch is considered when using a 2D recording of a real actor's lip movements to control a 3D synthetic character.

The impact of differences in facial features between real speakers and 3D face models on synthesized lip motions

TL;DR

The paper addresses how mismatches between real facial features and 3D morphable representations influence lip-synchronization in 3D facial animation. It builds per-speaker 3DMMs from synthetic head poses, maps 2D lip-motion data to 3D landmarks, and evaluates effects using both quantitative RMSE metrics and qualitative human judgments. Key findings show that mismatches in mouth height (index 7) substantially degrade lip motion, while mouth width (index 10) yields more variable results depending on lip thickness and facial geometry; these insights guide when to pair real actors with non-corresponding 3D faces. The study provides practical guidelines for animation and training systems, highlighting the need to consider facial-feature alignment when driving 3D characters from 2D lip recordings and suggesting future work to expand analyses to additional facial proportions.

Abstract

Lip motion accuracy is important for speech intelligibility, especially for users who are hard of hearing or second language learners. A high level of realism in lip movements is also required for the game and film production industries. 3D morphable models (3DMMs) have been widely used for facial analysis and animation. However, factors that could influence their use in facial animation, such as the differences in facial features between recorded real faces and animated synthetic faces, have not been given adequate attention. This paper investigates the mapping between real speakers and similar and non-similar 3DMMs and the impact on the resulting 3D lip motion. Mouth height and mouth width are used to determine face similarity. The results show that mapping 2D videos of real speakers with low mouth heights to 3D heads that correspond to real speakers with high mouth heights, or vice versa, generates less good 3D lip motion. It is thus important that such a mismatch is considered when using a 2D recording of a real actor's lip movements to control a 3D synthetic character.
Paper Structure (15 sections, 3 equations, 17 figures, 7 tables)

This paper contains 15 sections, 3 equations, 17 figures, 7 tables.

Figures (17)

  • Figure 1: Face landmarks (left) and measurements used for each video frame (right).
  • Figure 2: Classification of indices for each speaker of the audio-visual Lombard grid speech corpus (80% confidence level for each class of each index). The x axis shows the speaker's ID (where M refers to male speaker and F refers to female speaker) and the y axis shows the indices' number. Number of speakers in each class of mouth height and mouth width indices are shown in the relevant circles at the top of the figure.
  • Figure 3: An example of the mapping process between 2D video frames of a real speaker (ID: S17) who classified under the high class of index 7, the corresponding 3D head, and the non-corresponding 3D heads.
  • Figure 4: Consecutive frames of the phoneme /ih/ during utterance of the word "in" from sentence "bin white in O seven now" for a real speaker (ID: S47) who is classified under the low class of index 7, the corresponding 3D head, the non-corresponding low, the non-corresponding middle and the non-corresponding high 3D heads.
  • Figure 5: Width (upper) and height (lower) of mouth trajectories of 2D frames of the real speaker (ID: S31) classified under the low class of index 7, the corresponding 3D head, the non-corresponding middle 3D head (ID: S32) and the non-corresponding high 3D head (ID: S19), whilst uttering the sentence "set white at D zero please".
  • ...and 12 more figures