Table of Contents
Fetching ...

Exploring Phonetic Context-Aware Lip-Sync For Talking Face Generation

Se Jin Park, Minsu Kim, Jeongsoo Choi, Yong Man Ro

TL;DR

The paper tackles realistic lip-sync in talking-face generation by modeling phonetic context to address coarticulation. It introduces Context-Aware Lip-Sync (CALS), a two-module framework: Audio-to-Lip (A2L) converts phonetic-contextualized phones into lip motion units via masked learning, and Lip-to-Face (L2F) synthesizes the target identity using those units. A discriminative sync loss combining audio-visual alignment (d_av) and visual discriminative consistency (d_vv) enforces synchronized and distinctive lip motion. Experiments on LRW, LRS2, and HDTF show CALS achieving superior lip-sync and visual quality, with an effective context window of about 1.2 seconds. The approach yields temporally stable, phoneme-aware lip movements across identities, advancing practical talking-face generation.

Abstract

Talking face generation is the challenging task of synthesizing a natural and realistic face that requires accurate synchronization with a given audio. Due to co-articulation, where an isolated phone is influenced by the preceding or following phones, the articulation of a phone varies upon the phonetic context. Therefore, modeling lip motion with the phonetic context can generate more spatio-temporally aligned lip movement. In this respect, we investigate the phonetic context in generating lip motion for talking face generation. We propose Context-Aware Lip-Sync framework (CALS), which explicitly leverages phonetic context to generate lip movement of the target face. CALS is comprised of an Audio-to-Lip module and a Lip-to-Face module. The former is pretrained based on masked learning to map each phone to a contextualized lip motion unit. The contextualized lip motion unit then guides the latter in synthesizing a target identity with context-aware lip motion. From extensive experiments, we verify that simply exploiting the phonetic context in the proposed CALS framework effectively enhances spatio-temporal alignment. We also demonstrate the extent to which the phonetic context assists in lip synchronization and find the effective window size for lip generation to be approximately 1.2 seconds.

Exploring Phonetic Context-Aware Lip-Sync For Talking Face Generation

TL;DR

The paper tackles realistic lip-sync in talking-face generation by modeling phonetic context to address coarticulation. It introduces Context-Aware Lip-Sync (CALS), a two-module framework: Audio-to-Lip (A2L) converts phonetic-contextualized phones into lip motion units via masked learning, and Lip-to-Face (L2F) synthesizes the target identity using those units. A discriminative sync loss combining audio-visual alignment (d_av) and visual discriminative consistency (d_vv) enforces synchronized and distinctive lip motion. Experiments on LRW, LRS2, and HDTF show CALS achieving superior lip-sync and visual quality, with an effective context window of about 1.2 seconds. The approach yields temporally stable, phoneme-aware lip movements across identities, advancing practical talking-face generation.

Abstract

Talking face generation is the challenging task of synthesizing a natural and realistic face that requires accurate synchronization with a given audio. Due to co-articulation, where an isolated phone is influenced by the preceding or following phones, the articulation of a phone varies upon the phonetic context. Therefore, modeling lip motion with the phonetic context can generate more spatio-temporally aligned lip movement. In this respect, we investigate the phonetic context in generating lip motion for talking face generation. We propose Context-Aware Lip-Sync framework (CALS), which explicitly leverages phonetic context to generate lip movement of the target face. CALS is comprised of an Audio-to-Lip module and a Lip-to-Face module. The former is pretrained based on masked learning to map each phone to a contextualized lip motion unit. The contextualized lip motion unit then guides the latter in synthesizing a target identity with context-aware lip motion. From extensive experiments, we verify that simply exploiting the phonetic context in the proposed CALS framework effectively enhances spatio-temporal alignment. We also demonstrate the extent to which the phonetic context assists in lip synchronization and find the effective window size for lip generation to be approximately 1.2 seconds.
Paper Structure (15 sections, 7 equations, 4 figures, 3 tables)

This paper contains 15 sections, 7 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Audio-to-Lip module takes phone-level masked audio units as input and aims to predict the corresponding lip motion units of the masked regions.
  • Figure 2: Generation of frames with corresponding audio time-steps masked out. Please zoom in to see in detail.
  • Figure 3: LMD of the middle frame with varying audio window size on (a) LRW and (b) LRS2.
  • Figure 4: Qualitative comparison with state-of-the-art methods on HDTF (a) pronounces 'com' in 'combating', (b) 'asked', (c) 'a' in 'all'. Phonemes corresponding to each frame are written under.