Table of Contents
Fetching ...

Audio-driven Talking Face Generation with Stabilized Synchronization Loss

Dogucan Yaman, Fevziye Irem Eyiokur, Leonard Bärmann, Hazim Kemal Ekenel, Alexander Waibel

TL;DR

This work tackles the challenges of lip synchronization and visual fidelity in audio-driven talking-face generation by identifying SyncNet instabilities and lip leaking from identity references. It introduces a silent-lip generator to curb leakage, a robust AVSyncNet for stable synchronization, and a stabilized synchronization loss $L_{ss}$ to provide reliable learning signals. An end-to-end pipeline with separate identity and pose encoders, a frozen audio encoder, and adaptive triplet loss yields state-of-the-art visual quality and lip synchronization on multiple benchmarks, validated by extensive ablations. The approach offers practical benefits for applications like dubbing and video conferencing while acknowledging limitations and ethical implications, including misuse risks that warrant safeguards such as watermarking.

Abstract

Talking face generation aims to create realistic videos with accurate lip synchronization and high visual quality, using given audio and reference video while preserving identity and visual characteristics. In this paper, we start by identifying several issues with existing synchronization learning methods. These involve unstable training, lip synchronization, and visual quality issues caused by lip-sync loss, SyncNet, and lip leaking from the identity reference. To address these issues, we first tackle the lip leaking problem by introducing a silent-lip generator, which changes the lips of the identity reference to alleviate leakage. We then introduce stabilized synchronization loss and AVSyncNet to overcome problems caused by lip-sync loss and SyncNet. Experiments show that our model outperforms state-of-the-art methods in both visual quality and lip synchronization. Comprehensive ablation studies further validate our individual contributions and their cohesive effects.

Audio-driven Talking Face Generation with Stabilized Synchronization Loss

TL;DR

This work tackles the challenges of lip synchronization and visual fidelity in audio-driven talking-face generation by identifying SyncNet instabilities and lip leaking from identity references. It introduces a silent-lip generator to curb leakage, a robust AVSyncNet for stable synchronization, and a stabilized synchronization loss to provide reliable learning signals. An end-to-end pipeline with separate identity and pose encoders, a frozen audio encoder, and adaptive triplet loss yields state-of-the-art visual quality and lip synchronization on multiple benchmarks, validated by extensive ablations. The approach offers practical benefits for applications like dubbing and video conferencing while acknowledging limitations and ethical implications, including misuse risks that warrant safeguards such as watermarking.

Abstract

Talking face generation aims to create realistic videos with accurate lip synchronization and high visual quality, using given audio and reference video while preserving identity and visual characteristics. In this paper, we start by identifying several issues with existing synchronization learning methods. These involve unstable training, lip synchronization, and visual quality issues caused by lip-sync loss, SyncNet, and lip leaking from the identity reference. To address these issues, we first tackle the lip leaking problem by introducing a silent-lip generator, which changes the lips of the identity reference to alleviate leakage. We then introduce stabilized synchronization loss and AVSyncNet to overcome problems caused by lip-sync loss and SyncNet. Experiments show that our model outperforms state-of-the-art methods in both visual quality and lip synchronization. Comprehensive ablation studies further validate our individual contributions and their cohesive effects.
Paper Structure (25 sections, 5 equations, 6 figures, 2 tables)

This paper contains 25 sections, 5 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: (a, b) Cosine similarity between GT audio-lip pairs on random LRS2 samples, showcasing the instability of SyncNet and more robust performance of AVSyncNet. (c) illustrates full mouth region / lip leaking from the reference, pose effect from the reference, and similar identity reference-target image scenarios.
  • Figure 2: Talking face generation model $G_L$ (a) and face-decoding (FD) block (b). Our model receives a pose reference sequence, mel-spectrogram of an audio snippet, and a silent identity reference, that is generated by our silent-lip generator $G_S$, aiming to alleviate lip leaking problem. The model then synthesizes the talking face sequence to ensure lip synchronization. Subsequently, the employed loss functions are computed.
  • Figure 3: $G_S$ in inference (a) and AVSyncNet training pipeline (b).
  • Figure 4: Qualitative comparison with the SOTA methods. Reference videos (from HDTF zhang2021flow) are randomly selected and not seen during training by our model. For more images and videos, please check App. E, F and https://yamand16.github.io/TalkingFaceGeneration/.
  • Figure 5: Ablation studies of components (a), face restoration methods (b), and silent face generation (d). (c) demonstrates generated images in challenging cases.
  • ...and 1 more figures