Table of Contents
Fetching ...

Ges-QA: A Multidimensional Quality Assessment Dataset for Audio-to-3D Gesture Generation

Zhilin Gao, Yunhao Li, Sijing Wu, Yuqin Cao, Huiyu Duan, Guangtao Zhai

Abstract

The Audio-to-3D-Gesture (A2G) task has enormous potential for various applications in virtual reality and computer graphics, etc. However, current evaluation metrics, such as Fréchet Gesture Distance or Beat Constancy, fail at reflecting the human preference of the generated 3D gestures. To cope with this problem, exploring human preference and an objective quality assessment metric for AI-generated 3D human gestures is becoming increasingly significant. In this paper, we introduce the Ges-QA dataset, which includes 1,400 samples with multidimensional scores for gesture quality and audio-gesture consistency. Moreover, we collect binary classification labels to determine whether the generated gestures match the emotions of the audio. Equipped with our Ges-QA dataset, we propose a multi-modal transformer-based neural network with 3 branches for video, audio and 3D skeleton modalities, which can score A2G contents in multiple dimensions. Comparative experimental results and ablation studies demonstrate that Ges-QAer yields state-of-the-art performance on our dataset.

Ges-QA: A Multidimensional Quality Assessment Dataset for Audio-to-3D Gesture Generation

Abstract

The Audio-to-3D-Gesture (A2G) task has enormous potential for various applications in virtual reality and computer graphics, etc. However, current evaluation metrics, such as Fréchet Gesture Distance or Beat Constancy, fail at reflecting the human preference of the generated 3D gestures. To cope with this problem, exploring human preference and an objective quality assessment metric for AI-generated 3D human gestures is becoming increasingly significant. In this paper, we introduce the Ges-QA dataset, which includes 1,400 samples with multidimensional scores for gesture quality and audio-gesture consistency. Moreover, we collect binary classification labels to determine whether the generated gestures match the emotions of the audio. Equipped with our Ges-QA dataset, we propose a multi-modal transformer-based neural network with 3 branches for video, audio and 3D skeleton modalities, which can score A2G contents in multiple dimensions. Comparative experimental results and ablation studies demonstrate that Ges-QAer yields state-of-the-art performance on our dataset.

Paper Structure

This paper contains 19 sections, 1 equation, 5 figures, 1 table.

Figures (5)

  • Figure 1: The pipeline overview of A2G quality assessment task. Following data acquisition and Ges-QA dataset establishment via subjective experiment, we train an operational quality evaluation framework Ges-QAer.
  • Figure 2: The construction process of Ges-QA dataset. Step 1 displays eight emotion categories of audio data. In Step 3, subjects will provide two Mean Opinion Scores (MOSs) from different dimensions and one Subjective Binary Annotation for Emotion (ESBA) for each sample.
  • Figure 3: Mean Opinion Score comparison of gesture quality and audio-gesture consistency across multiple A2G approaches.
  • Figure 4: Visualization of emotion congruence accuracy across multiple A2G approaches under different emotions.
  • Figure 5: The architecture of Ges-QAer. Ges-QAer uses three separate encoders to achieve single-modality representations, and a Multilayer Perceptron (MLP) for feature fusion.