Table of Contents
Fetching ...

OpFlowTalker: Realistic and Natural Talking Face Generation via Optical Flow Guidance

Shuheng Ge, Haoyu Xing, Li Zhang, Xiangqian Wu

TL;DR

OpFlowTalker advances talking-face generation by shifting from direct image prediction to predicting inter-frame optical flow driven by audio, thereby improving temporal coherence and lip-readability. The framework introduces Facial Sequential Generation via Optical Flow (FSG) with audio augmentation, an Optical Flow Predictor, Sequential Fusion, and Facial Reconstruction, along with an Optical Flow Synchronization Module (OFSM) that enforces both frame-to-frame continuity and audio-lip alignment through dedicated losses and a lip-specific flow constraint. A Visual Text Consistency Score (VTCS) is proposed to quantify lip-reading intelligibility of synthesized videos. Empirical results on LRS2 and HDTF show state-of-the-art performance across multiple metrics, and ablations validate the contribution of each component, highlighting improved generalization and realism. The work offers practical implications for realistic avatar synthesis in VR, film, and education, while noting limitations related to resolution and expressive range that warrant future work.

Abstract

Creating realistic, natural, and lip-readable talking face videos remains a formidable challenge. Previous research primarily concentrated on generating and aligning single-frame images while overlooking the smoothness of frame-to-frame transitions and temporal dependencies. This often compromised visual quality and effects in practical settings, particularly when handling complex facial data and audio content, which frequently led to semantically incongruent visual illusions. Specifically, synthesized videos commonly featured disorganized lip movements, making them difficult to understand and recognize. To overcome these limitations, this paper introduces the application of optical flow to guide facial image generation, enhancing inter-frame continuity and semantic consistency. We propose "OpFlowTalker", a novel approach that utilizes predicted optical flow changes from audio inputs rather than direct image predictions. This method smooths image transitions and aligns changes with semantic content. Moreover, it employs a sequence fusion technique to replace the independent generation of single frames, thus preserving contextual information and maintaining temporal coherence. We also developed an optical flow synchronization module that regulates both full-face and lip movements, optimizing visual synthesis by balancing regional dynamics. Furthermore, we introduce a Visual Text Consistency Score (VTCS) that accurately measures lip-readability in synthesized videos. Extensive empirical evidence validates the effectiveness of our approach.

OpFlowTalker: Realistic and Natural Talking Face Generation via Optical Flow Guidance

TL;DR

OpFlowTalker advances talking-face generation by shifting from direct image prediction to predicting inter-frame optical flow driven by audio, thereby improving temporal coherence and lip-readability. The framework introduces Facial Sequential Generation via Optical Flow (FSG) with audio augmentation, an Optical Flow Predictor, Sequential Fusion, and Facial Reconstruction, along with an Optical Flow Synchronization Module (OFSM) that enforces both frame-to-frame continuity and audio-lip alignment through dedicated losses and a lip-specific flow constraint. A Visual Text Consistency Score (VTCS) is proposed to quantify lip-reading intelligibility of synthesized videos. Empirical results on LRS2 and HDTF show state-of-the-art performance across multiple metrics, and ablations validate the contribution of each component, highlighting improved generalization and realism. The work offers practical implications for realistic avatar synthesis in VR, film, and education, while noting limitations related to resolution and expressive range that warrant future work.

Abstract

Creating realistic, natural, and lip-readable talking face videos remains a formidable challenge. Previous research primarily concentrated on generating and aligning single-frame images while overlooking the smoothness of frame-to-frame transitions and temporal dependencies. This often compromised visual quality and effects in practical settings, particularly when handling complex facial data and audio content, which frequently led to semantically incongruent visual illusions. Specifically, synthesized videos commonly featured disorganized lip movements, making them difficult to understand and recognize. To overcome these limitations, this paper introduces the application of optical flow to guide facial image generation, enhancing inter-frame continuity and semantic consistency. We propose "OpFlowTalker", a novel approach that utilizes predicted optical flow changes from audio inputs rather than direct image predictions. This method smooths image transitions and aligns changes with semantic content. Moreover, it employs a sequence fusion technique to replace the independent generation of single frames, thus preserving contextual information and maintaining temporal coherence. We also developed an optical flow synchronization module that regulates both full-face and lip movements, optimizing visual synthesis by balancing regional dynamics. Furthermore, we introduce a Visual Text Consistency Score (VTCS) that accurately measures lip-readability in synthesized videos. Extensive empirical evidence validates the effectiveness of our approach.
Paper Structure (26 sections, 11 equations, 6 figures, 7 tables)

This paper contains 26 sections, 11 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: The example shows faces, lip shapes, and lip optical flow for different scenarios: different individuals producing the same vocalization and the same individual making different vocalizations. On the left, there is a visualization of the optical flow space, where different colors represent different directions of optical flow, and the depth of color indicates the intensity of the flow.
  • Figure 2: llustration of our proposed OpFlowTalker. (a) OpFlowTalker framework, which generates talking faces based on consistent audio input $a$. OpFlowTalker integrates four effective components to enhance the generated videos quality: Audio Augementation, Optical Flow Predictor, Sequential Fusion, Optical Flow Synchronization Module. (b) Sequential Fusion, which predictes each frame relying on all preceding reference information.Fusion1 consists of 2-linear-layers and Fusion2 is a 6-layer Transformer. (c)Optical Flow Synchronization Module, which aggregates global motion information and calculates the synchronization loss of facial and lip optical flow separately. $\alpha$ is generally set to 0.1.
  • Figure 3: We compare our method with several state-of-the-art methods for audio-driven talking face generation. Different colors represent different syllables, corresponding to each image.
  • Figure 4: A comparison of our method with Synctalk, DreamReTalking, and Wav2Lip in generating high-definition videos (256x256) on the HDTF dataset. SR stands for Super-Resolution.
  • Figure 5: More comparisons of our method with several state-of-the-art methods for audio-driven talking face generation. Different colors represent different syllables, corresponding to each image.
  • ...and 1 more figures