Table of Contents
Fetching ...

What comprises a good talking-head video generation?: A Survey and Benchmark

Lele Chen, Guofeng Cui, Ziyi Kou, Haitian Zheng, Chenliang Xu

TL;DR

This survey targets the evaluation of identity-independent talking-head video generation, identifying four core desiderata: identity preservation, lip synchronization, visual quality, and natural spontaneous motion. It proposes a unified benchmark with standardized preprocessing and a mix of existing and new metrics to quantify these criteria, including three perceptual measures LRSD, ESD, and BSD. Through benchmarking state-of-the-art methods, the authors reveal how head pose and motion influence identity and quality and highlight the persistent challenges in semantic-level lip-sync. The work offers publicly available code to facilitate fair comparisons and guide future development toward more realistic and semantically synchronized talking-head generation.

Abstract

Over the years, performance evaluation has become essential in computer vision, enabling tangible progress in many sub-fields. While talking-head video generation has become an emerging research topic, existing evaluations on this topic present many limitations. For example, most approaches use human subjects (e.g., via Amazon MTurk) to evaluate their research claims directly. This subjective evaluation is cumbersome, unreproducible, and may impend the evolution of new research. In this work, we present a carefully-designed benchmark for evaluating talking-head video generation with standardized dataset pre-processing strategies. As for evaluation, we either propose new metrics or select the most appropriate ones to evaluate results in what we consider as desired properties for a good talking-head video, namely, identity preserving, lip synchronization, high video quality, and natural-spontaneous motion. By conducting a thoughtful analysis across several state-of-the-art talking-head generation approaches, we aim to uncover the merits and drawbacks of current methods and point out promising directions for future work. All the evaluation code is available at: https://github.com/lelechen63/talking-head-generation-survey.

What comprises a good talking-head video generation?: A Survey and Benchmark

TL;DR

This survey targets the evaluation of identity-independent talking-head video generation, identifying four core desiderata: identity preservation, lip synchronization, visual quality, and natural spontaneous motion. It proposes a unified benchmark with standardized preprocessing and a mix of existing and new metrics to quantify these criteria, including three perceptual measures LRSD, ESD, and BSD. Through benchmarking state-of-the-art methods, the authors reveal how head pose and motion influence identity and quality and highlight the persistent challenges in semantic-level lip-sync. The work offers publicly available code to facilitate fair comparisons and guide future development toward more realistic and semantically synchronized talking-head generation.

Abstract

Over the years, performance evaluation has become essential in computer vision, enabling tangible progress in many sub-fields. While talking-head video generation has become an emerging research topic, existing evaluations on this topic present many limitations. For example, most approaches use human subjects (e.g., via Amazon MTurk) to evaluate their research claims directly. This subjective evaluation is cumbersome, unreproducible, and may impend the evolution of new research. In this work, we present a carefully-designed benchmark for evaluating talking-head video generation with standardized dataset pre-processing strategies. As for evaluation, we either propose new metrics or select the most appropriate ones to evaluate results in what we consider as desired properties for a good talking-head video, namely, identity preserving, lip synchronization, high video quality, and natural-spontaneous motion. By conducting a thoughtful analysis across several state-of-the-art talking-head generation approaches, we aim to uncover the merits and drawbacks of current methods and point out promising directions for future work. All the evaluation code is available at: https://github.com/lelechen63/talking-head-generation-survey.

Paper Structure

This paper contains 24 sections, 8 equations, 25 figures, 7 tables.

Figures (25)

  • Figure 1: The general framework of talking-head generation methods.
  • Figure 2: The network illustration of skip connection and image matting function. (a) shows the detailed network structure of Jamaludin et al. jamaludin2019you and the skip connections design. (b) illustrates the image matting function, where $\mathbf{A}$ is the attention map obtained by applying Sigmoid activation function and $\mathbf{C}$ is the color mask obtained by applying Tanh activation function.
  • Figure 3: Example images of different datasets. For each dataset of Table \ref{['Table:dataset']}, several frames of video are sampled and represented.
  • Figure 4: The left column indicates the Euler angler system. On the right side, the first three rows show the distribution of head poses across different datasets in Pitch-Axis, Yaw-Axis, and Roll-Axis, respectively. The last row shows the distribution of head motion across different datasets. All the X-axis, Y-axis are the degree and ratio, respectively.
  • Figure 5: The video frames with changing facial appearance. The second row shows the results synthesized by our baseline on VoxCeleb2 testing set.
  • ...and 20 more figures