Table of Contents
Fetching ...

Evaluation of Generative Models for Emotional 3D Animation Generation in VR

Kiran Chhatre, Renan Guarese, Andrii Matviienko, Christopher Peters

TL;DR

The paper investigates how emotional 3D animations driven by speech are perceived in immersive VR. It compares three state-of-the-art speech-driven methods against a real-human reconstruction baseline across two arousal states, using a VR-based, user-centered evaluation with N=48 participants. Findings show that explicit emotion modeling improves arousal recognition and that high-arousal happy expressions are perceived more realistically, while reconstruction-based facial expressions outperform generative methods in facial realism. The study highlights limitations in animation enjoyment and dyadic interaction quality, emphasizes diversity benefits of certain models, and argues for integrating perceptual evaluations into model development to guide future work. Overall, the work provides a rigorous, user-focused benchmark and design recommendations for emotionally expressive VR agents.

Abstract

Social interactions incorporate nonverbal signals to convey emotions alongside speech, including facial expressions and body gestures. Generative models have demonstrated promising results in creating full-body nonverbal animations synchronized with speech; however, evaluations using statistical metrics in 2D settings fail to fully capture user-perceived emotions, limiting our understanding of model effectiveness. To address this, we evaluate emotional 3D animation generative models within a Virtual Reality (VR) environment, emphasizing user-centric metrics emotional arousal realism, naturalness, enjoyment, diversity, and interaction quality in a real-time human-agent interaction scenario. Through a user study (N=48), we examine perceived emotional quality for three state of the art speech-driven 3D animation methods across two emotions happiness (high arousal) and neutral (mid arousal). Additionally, we compare these generative models against real human expressions obtained via a reconstruction-based method to assess both their strengths and limitations and how closely they replicate real human facial and body expressions. Our results demonstrate that methods explicitly modeling emotions lead to higher recognition accuracy compared to those focusing solely on speech-driven synchrony. Users rated the realism and naturalness of happy animations significantly higher than those of neutral animations, highlighting the limitations of current generative models in handling subtle emotional states. Generative models underperformed compared to reconstruction-based methods in facial expression quality, and all methods received relatively low ratings for animation enjoyment and interaction quality, emphasizing the importance of incorporating user-centric evaluations into generative model development. Finally, participants positively recognized animation diversity across all generative models.

Evaluation of Generative Models for Emotional 3D Animation Generation in VR

TL;DR

The paper investigates how emotional 3D animations driven by speech are perceived in immersive VR. It compares three state-of-the-art speech-driven methods against a real-human reconstruction baseline across two arousal states, using a VR-based, user-centered evaluation with N=48 participants. Findings show that explicit emotion modeling improves arousal recognition and that high-arousal happy expressions are perceived more realistically, while reconstruction-based facial expressions outperform generative methods in facial realism. The study highlights limitations in animation enjoyment and dyadic interaction quality, emphasizes diversity benefits of certain models, and argues for integrating perceptual evaluations into model development to guide future work. Overall, the work provides a rigorous, user-focused benchmark and design recommendations for emotionally expressive VR agents.

Abstract

Social interactions incorporate nonverbal signals to convey emotions alongside speech, including facial expressions and body gestures. Generative models have demonstrated promising results in creating full-body nonverbal animations synchronized with speech; however, evaluations using statistical metrics in 2D settings fail to fully capture user-perceived emotions, limiting our understanding of model effectiveness. To address this, we evaluate emotional 3D animation generative models within a Virtual Reality (VR) environment, emphasizing user-centric metrics emotional arousal realism, naturalness, enjoyment, diversity, and interaction quality in a real-time human-agent interaction scenario. Through a user study (N=48), we examine perceived emotional quality for three state of the art speech-driven 3D animation methods across two emotions happiness (high arousal) and neutral (mid arousal). Additionally, we compare these generative models against real human expressions obtained via a reconstruction-based method to assess both their strengths and limitations and how closely they replicate real human facial and body expressions. Our results demonstrate that methods explicitly modeling emotions lead to higher recognition accuracy compared to those focusing solely on speech-driven synchrony. Users rated the realism and naturalness of happy animations significantly higher than those of neutral animations, highlighting the limitations of current generative models in handling subtle emotional states. Generative models underperformed compared to reconstruction-based methods in facial expression quality, and all methods received relatively low ratings for animation enjoyment and interaction quality, emphasizing the importance of incorporating user-centric evaluations into generative model development. Finally, participants positively recognized animation diversity across all generative models.

Paper Structure

This paper contains 61 sections, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Evaluation of Generative Models for Emotional 3D Animation in VR. In this evaluation, participants interact with a virtual character using a VR headset. The setup is modular and supports integration of various text-to-speech (TTS) models and speech-driven 3D animation generation methods. On the right, the figure illustrates an interaction between the participant and the virtual character. Participants' positions are tracked by two base stations installed in the study room, and they use a tablet to record input during the session. The animation generation method utilizes speech segments generated by a TTS system to produce corresponding 3D facial expressions and body animations. These predicted animation data are mapped onto a 3D character, textures are applied via UV mapping, and the final content is rendered and streamed in real-time for VR interaction using Blender (OpenXR).
  • Figure 2: Emotion classification. Left: Ekman’s discrete-emotion theory identifies six basic categories—anger, disgust, fear, happiness, sadness, and surprise—treating each as a distinct class rather than points on a continuum ekman1993. Right: The circumplex model RussellCircumplex places emotions in a two-dimensional space spanned by arousal and valence; the center represents neutral arousal and neutral valence.
  • Figure 3: Qualitative evaluation. Top: Specific frames from the generated animation sequences using EMAGE Liu_2024_CVPR, TalkSHOW yi2023generating, and a combination of AMUSE (body animation) chhatre2023emotional and FaceFormer (facial expressions) fan2021faceformer. Bottom: The workflow for generating reconstruction-based animations from real human facial expressions and body gestures using driving video input, which serves as our baseline. The reconstruction method PIXIE PIXIE:2021 + DECA DECA:Siggraph2021 predicts pose parameters, normal maps, and textures, which are combined and rendered. Specific frames from the resulting video-based reconstruction animations are shown in the bottom right.
  • Figure 4: Summary of likert scale results. Summary of Likert scale ratings for Animation Realism (avatar felt like a real person), Animation Naturalness (facial expressions; body movements), Animation Enjoyment, and Interaction Quality (interaction warmth). For brevity, we denote EMAGE, TalkSHOW, PIXIE+DECA, and AMUSE+FaceFormer as M1, M2, M3, and M4, respectively, and use "High" and "Low" to represent happy and neutral emotions.
  • Figure A.1: Left: SMPL-X joints. Right: Blender render of an outdoor scene.