Table of Contents
Fetching ...

Assessing Identity Leakage in Talking Face Generation: Metrics and Evaluation Framework

Dogucan Yaman, Fevziye Irem Eyiokur, Hazım Kemal Ekenel, Alexander Waibel

TL;DR

The paper tackles lip leaking in inpainting-based talking face generation, where identity reference lips can undesirably bias lip motion. It introduces a model-agnostic evaluation framework comprising three generation setups (Silent-Input, Audio-Matched, Audio-Mismatched), two identity-reference strategies (Current, Alternative), and four metrics (Silent LSE-C, Silent LSE-D, LSD-CR, LSD-AR) to quantify leakage and assess robustness. Through experiments on the LRS2 dataset, it demonstrates that leakage patterns vary across methods and reference choices, with some models showing resilience under Alternative references and others remaining vulnerable; XM analyses reveal leakage not evident under standard AM metrics. The framework offers a practical benchmark and guidance for reference design to improve controllability and reliability in talking-face systems, and points to future evaluations of newer methods like LatentSync, MuseTalk, and OmniSync.

Abstract

Inpainting-based talking face generation aims to preserve video details such as pose, lighting, and gestures while modifying only lip motion, often using an identity reference image to maintain speaker consistency. However, this mechanism can introduce lip leaking, where generated lips are influenced by the reference image rather than solely by the driving audio. Such leakage is difficult to detect with standard metrics and conventional test setup. To address this, we propose a systematic evaluation methodology to analyze and quantify lip leakage. Our framework employs three complementary test setups: silent-input generation, mismatched audio-video pairing, and matched audio-video synthesis. We also introduce derived metrics including lip-sync discrepancy and silent-audio-based lip-sync scores. In addition, we study how different identity reference selections affect leakage, providing insights into reference design. The proposed methodology is model-agnostic and establishes a more reliable benchmark for future research in talking face generation.

Assessing Identity Leakage in Talking Face Generation: Metrics and Evaluation Framework

TL;DR

The paper tackles lip leaking in inpainting-based talking face generation, where identity reference lips can undesirably bias lip motion. It introduces a model-agnostic evaluation framework comprising three generation setups (Silent-Input, Audio-Matched, Audio-Mismatched), two identity-reference strategies (Current, Alternative), and four metrics (Silent LSE-C, Silent LSE-D, LSD-CR, LSD-AR) to quantify leakage and assess robustness. Through experiments on the LRS2 dataset, it demonstrates that leakage patterns vary across methods and reference choices, with some models showing resilience under Alternative references and others remaining vulnerable; XM analyses reveal leakage not evident under standard AM metrics. The framework offers a practical benchmark and guidance for reference design to improve controllability and reliability in talking-face systems, and points to future evaluations of newer methods like LatentSync, MuseTalk, and OmniSync.

Abstract

Inpainting-based talking face generation aims to preserve video details such as pose, lighting, and gestures while modifying only lip motion, often using an identity reference image to maintain speaker consistency. However, this mechanism can introduce lip leaking, where generated lips are influenced by the reference image rather than solely by the driving audio. Such leakage is difficult to detect with standard metrics and conventional test setup. To address this, we propose a systematic evaluation methodology to analyze and quantify lip leakage. Our framework employs three complementary test setups: silent-input generation, mismatched audio-video pairing, and matched audio-video synthesis. We also introduce derived metrics including lip-sync discrepancy and silent-audio-based lip-sync scores. In addition, we study how different identity reference selections affect leakage, providing insights into reference design. The proposed methodology is model-agnostic and establishes a more reliable benchmark for future research in talking face generation.

Paper Structure

This paper contains 8 sections, 2 equations, 1 figure, 6 tables.

Figures (1)

  • Figure 1: In the standard talking face generation pipeline, the model receives a face sequence with the lower half masked, along with an identity reference image to guide accurate reconstruction of the masked region. The input audio drives lip movements to ensure synchronized speech.