The impact of differences in facial features between real speakers and 3D face models on synthesized lip motions
Rabab Algadhy, Yoshihiko Gotoh, Steve Maddock
TL;DR
The paper addresses how mismatches between real facial features and 3D morphable representations influence lip-synchronization in 3D facial animation. It builds per-speaker 3DMMs from synthetic head poses, maps 2D lip-motion data to 3D landmarks, and evaluates effects using both quantitative RMSE metrics and qualitative human judgments. Key findings show that mismatches in mouth height (index 7) substantially degrade lip motion, while mouth width (index 10) yields more variable results depending on lip thickness and facial geometry; these insights guide when to pair real actors with non-corresponding 3D faces. The study provides practical guidelines for animation and training systems, highlighting the need to consider facial-feature alignment when driving 3D characters from 2D lip recordings and suggesting future work to expand analyses to additional facial proportions.
Abstract
Lip motion accuracy is important for speech intelligibility, especially for users who are hard of hearing or second language learners. A high level of realism in lip movements is also required for the game and film production industries. 3D morphable models (3DMMs) have been widely used for facial analysis and animation. However, factors that could influence their use in facial animation, such as the differences in facial features between recorded real faces and animated synthetic faces, have not been given adequate attention. This paper investigates the mapping between real speakers and similar and non-similar 3DMMs and the impact on the resulting 3D lip motion. Mouth height and mouth width are used to determine face similarity. The results show that mapping 2D videos of real speakers with low mouth heights to 3D heads that correspond to real speakers with high mouth heights, or vice versa, generates less good 3D lip motion. It is thus important that such a mismatch is considered when using a 2D recording of a real actor's lip movements to control a 3D synthetic character.
