Table of Contents
Fetching ...

THQA: A Perceptual Quality Assessment Database for Talking Heads

Yingjie Zhou, Zicheng Zhang, Wei Sun, Xiaohong Liu, Xiongkuo Min, Zhihua Wang, Xiao-Ping Zhang, Guangtao Zhai

TL;DR

The paper addresses the lack of quality assessment tools for AI-generated talking-head videos by introducing THQA, a large-scale database containing 800 videos produced from 8 speech-driven methods using 20 StyleGAN faces. It details source material selection, speech driving approaches, and a comprehensive subjective MOS collection following established guidelines, along with a benchmark of no-reference IQA/VQA methods. Key findings show that existing NR metrics, particularly deep-learning-based ones, still fall short of correlating with human perception for TH content, with VSFA performing best among the tested methods. THQA thus provides a valuable resource for developing improved TH QoE assessment methods and guiding improvements in speech-driven TH generation.

Abstract

In the realm of media technology, digital humans have gained prominence due to rapid advancements in computer technology. However, the manual modeling and control required for the majority of digital humans pose significant obstacles to efficient development. The speech-driven methods offer a novel avenue for manipulating the mouth shape and expressions of digital humans. Despite the proliferation of driving methods, the quality of many generated talking head (TH) videos remains a concern, impacting user visual experiences. To tackle this issue, this paper introduces the Talking Head Quality Assessment (THQA) database, featuring 800 TH videos generated through 8 diverse speech-driven methods. Extensive experiments affirm the THQA database's richness in character and speech features. Subsequent subjective quality assessment experiments analyze correlations between scoring results and speech-driven methods, ages, and genders. In addition, experimental results show that mainstream image and video quality assessment methods have limitations for the THQA database, underscoring the imperative for further research to enhance TH video quality assessment. The THQA database is publicly accessible at https://github.com/zyj-2000/THQA.

THQA: A Perceptual Quality Assessment Database for Talking Heads

TL;DR

The paper addresses the lack of quality assessment tools for AI-generated talking-head videos by introducing THQA, a large-scale database containing 800 videos produced from 8 speech-driven methods using 20 StyleGAN faces. It details source material selection, speech driving approaches, and a comprehensive subjective MOS collection following established guidelines, along with a benchmark of no-reference IQA/VQA methods. Key findings show that existing NR metrics, particularly deep-learning-based ones, still fall short of correlating with human perception for TH content, with VSFA performing best among the tested methods. THQA thus provides a valuable resource for developing improved TH QoE assessment methods and guiding improvements in speech-driven TH generation.

Abstract

In the realm of media technology, digital humans have gained prominence due to rapid advancements in computer technology. However, the manual modeling and control required for the majority of digital humans pose significant obstacles to efficient development. The speech-driven methods offer a novel avenue for manipulating the mouth shape and expressions of digital humans. Despite the proliferation of driving methods, the quality of many generated talking head (TH) videos remains a concern, impacting user visual experiences. To tackle this issue, this paper introduces the Talking Head Quality Assessment (THQA) database, featuring 800 TH videos generated through 8 diverse speech-driven methods. Extensive experiments affirm the THQA database's richness in character and speech features. Subsequent subjective quality assessment experiments analyze correlations between scoring results and speech-driven methods, ages, and genders. In addition, experimental results show that mainstream image and video quality assessment methods have limitations for the THQA database, underscoring the imperative for further research to enhance TH video quality assessment. The THQA database is publicly accessible at https://github.com/zyj-2000/THQA.
Paper Structure (15 sections, 2 equations, 6 figures, 5 tables)

This paper contains 15 sections, 2 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Motivation of our works. Quality assessment against talking head videos is essential for both the further development of speech-driven methods and the enhancement of the user experience.
  • Figure 2: Overview of the selected human faces.
  • Figure 3: Phonological attributes of the selected speech, excluding samples displaying resonance peak merging. The first formant peak correlates with the amplitude of mouth opening and closing. Concurrently, the second formant peak is associated with tongue position. Bubble color denotes the subject ID corresponding to the speech, while bubble size serves as an indicator of speech duration.
  • Figure 4: Diverse distortions observed in TH videos. The left side displays frames from the video containing distortions, while the right side presents a magnified image for localized examination. The green, red, and blue borders are employed to respectively delineate the categories of good samples, distortion samples, and frames for comparison between video frames.
  • Figure 5: The screenshot of the subjective quality assessment interface.
  • ...and 1 more figures