Table of Contents
Fetching ...

Emo3D: Metric and Benchmarking Dataset for 3D Facial Expression Generation from Emotion Description

Mahshid Dehghani, Amirahmad Shafiee, Ali Shafiei, Neda Fallah, Farahmand Alizadeh, Mohammad Mehdi Gholinejad, Hamid Behroozi, Jafar Habibi, Ehsaneddin Asgari

TL;DR

An extensive Text-Image-Expression datasetspanning a wide spectrum of human emotions, each paired with images and 3D blendshapes, that demonstrates its superiority over Mean Squared Error (MSE) metrics in assessing visual-text alignment and semantic richness in 3D facial expressions associated with human emotions.

Abstract

Existing 3D facial emotion modeling have been constrained by limited emotion classes and insufficient datasets. This paper introduces "Emo3D", an extensive "Text-Image-Expression dataset" spanning a wide spectrum of human emotions, each paired with images and 3D blendshapes. Leveraging Large Language Models (LLMs), we generate a diverse array of textual descriptions, facilitating the capture of a broad spectrum of emotional expressions. Using this unique dataset, we conduct a comprehensive evaluation of language-based models' fine-tuning and vision-language models like Contranstive Language Image Pretraining (CLIP) for 3D facial expression synthesis. We also introduce a new evaluation metric for this task to more directly measure the conveyed emotion. Our new evaluation metric, Emo3D, demonstrates its superiority over Mean Squared Error (MSE) metrics in assessing visual-text alignment and semantic richness in 3D facial expressions associated with human emotions. "Emo3D" has great applications in animation design, virtual reality, and emotional human-computer interaction.

Emo3D: Metric and Benchmarking Dataset for 3D Facial Expression Generation from Emotion Description

TL;DR

An extensive Text-Image-Expression datasetspanning a wide spectrum of human emotions, each paired with images and 3D blendshapes, that demonstrates its superiority over Mean Squared Error (MSE) metrics in assessing visual-text alignment and semantic richness in 3D facial expressions associated with human emotions.

Abstract

Existing 3D facial emotion modeling have been constrained by limited emotion classes and insufficient datasets. This paper introduces "Emo3D", an extensive "Text-Image-Expression dataset" spanning a wide spectrum of human emotions, each paired with images and 3D blendshapes. Leveraging Large Language Models (LLMs), we generate a diverse array of textual descriptions, facilitating the capture of a broad spectrum of emotional expressions. Using this unique dataset, we conduct a comprehensive evaluation of language-based models' fine-tuning and vision-language models like Contranstive Language Image Pretraining (CLIP) for 3D facial expression synthesis. We also introduce a new evaluation metric for this task to more directly measure the conveyed emotion. Our new evaluation metric, Emo3D, demonstrates its superiority over Mean Squared Error (MSE) metrics in assessing visual-text alignment and semantic richness in 3D facial expressions associated with human emotions. "Emo3D" has great applications in animation design, virtual reality, and emotional human-computer interaction.
Paper Structure (13 sections, 4 equations, 10 figures, 6 tables)

This paper contains 13 sections, 4 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Emo3D Dataset Creation: Textual data describing human emotions is initially generated using GPT OpenAI2023. We then utilize DALL-E models ramesh2022hierarchical to synthesize human faces. Each image undergoes face blendshape extraction using MediaPipe Lugaresi2019. Furthermore, we employ GPT OpenAI2023 to extract the emotion distribution for each prompt.
  • Figure 2: "Surprise" Emotion Word Cloud: closest words to "surprise" using Emolex LREC18-AIL based on cosine similarity of emotion distribution.
  • Figure 3: Emotion-XML uses emotion ground truth to predict facial blendshapes. An Emotion Extractor guides the Regression model with the Teacher-Forcing technique at a 50% ratio. Both units are trained via Mean Squared Error (MSE) loss.
  • Figure 4: VAE CLIP concurrently reconstructs facial expressions while aligning their latent representation with corresponding text and image representations in the CLIP space.
  • Figure 5: Our methodology in Emo3D metric entails selecting $n$ prompts with a balanced emotion distribution. We generate facial expressions using a text-utilizing FEG model for a given input prompt. We project the 3D face model onto a 2D image and employ zero-shot CLIP to identify the $k$ nearest text prompts. Subsequently, we compute the Kullback-Leibler (KL) divergence between the emotion distribution of the input text and these $k$ prompts.
  • ...and 5 more figures