Exploring Phonetic Context-Aware Lip-Sync For Talking Face Generation
Se Jin Park, Minsu Kim, Jeongsoo Choi, Yong Man Ro
TL;DR
The paper tackles realistic lip-sync in talking-face generation by modeling phonetic context to address coarticulation. It introduces Context-Aware Lip-Sync (CALS), a two-module framework: Audio-to-Lip (A2L) converts phonetic-contextualized phones into lip motion units via masked learning, and Lip-to-Face (L2F) synthesizes the target identity using those units. A discriminative sync loss combining audio-visual alignment (d_av) and visual discriminative consistency (d_vv) enforces synchronized and distinctive lip motion. Experiments on LRW, LRS2, and HDTF show CALS achieving superior lip-sync and visual quality, with an effective context window of about 1.2 seconds. The approach yields temporally stable, phoneme-aware lip movements across identities, advancing practical talking-face generation.
Abstract
Talking face generation is the challenging task of synthesizing a natural and realistic face that requires accurate synchronization with a given audio. Due to co-articulation, where an isolated phone is influenced by the preceding or following phones, the articulation of a phone varies upon the phonetic context. Therefore, modeling lip motion with the phonetic context can generate more spatio-temporally aligned lip movement. In this respect, we investigate the phonetic context in generating lip motion for talking face generation. We propose Context-Aware Lip-Sync framework (CALS), which explicitly leverages phonetic context to generate lip movement of the target face. CALS is comprised of an Audio-to-Lip module and a Lip-to-Face module. The former is pretrained based on masked learning to map each phone to a contextualized lip motion unit. The contextualized lip motion unit then guides the latter in synthesizing a target identity with context-aware lip motion. From extensive experiments, we verify that simply exploiting the phonetic context in the proposed CALS framework effectively enhances spatio-temporal alignment. We also demonstrate the extent to which the phonetic context assists in lip synchronization and find the effective window size for lip generation to be approximately 1.2 seconds.
