Table of Contents
Fetching ...

Unimodal and Multimodal Static Facial Expression Recognition for Virtual Reality Users with EmoHeVRDB

Thorben Ortmann, Qi Wang, Larissa Putzar

TL;DR

This work addresses the challenge of facial expression recognition in virtual reality where head-mounted displays occlude the upper face, by leveraging Facial Expression Activations (FEAs) from the Meta Quest Pro and the EmoHeVRDB dataset. It demonstrates unimodal FER using FEAs, achieving a peak of 73.02% accuracy with logistic regression, and shows that FEA-based methods can surpass image-based FER on EmoHeVRDB by up to 3.18%. Building on this, the authors fuse FEAs with image data through late and intermediate fusion strategies, with intermediate fusion reaching 80.42% accuracy and outperforming unimodal baselines. The results establish new VR FER benchmarks using EmoHeVRDB and highlight the value of multimodal fusion to mitigate HMD-induced occlusion, offering practical implications for emotion-aware VR applications and future work on dynamic, sequential FER.

Abstract

In this study, we explored the potential of utilizing Facial Expression Activations (FEAs) captured via the Meta Quest Pro Virtual Reality (VR) headset for Facial Expression Recognition (FER) in VR settings. Leveraging the EmojiHeroVR Database (EmoHeVRDB), we compared several unimodal approaches and achieved up to 73.02% accuracy for the static FER task with seven emotion categories. Furthermore, we integrated FEA and image data in multimodal approaches, observing significant improvements in recognition accuracy. An intermediate fusion approach achieved the highest accuracy of 80.42%, significantly surpassing the baseline evaluation result of 69.84% reported for EmoHeVRDB's image data. Our study is the first to utilize EmoHeVRDB's unique FEA data for unimodal and multimodal static FER, establishing new benchmarks for FER in VR settings. Our findings highlight the potential of fusing complementary modalities to enhance FER accuracy in VR settings, where conventional image-based methods are severely limited by the occlusion caused by Head-Mounted Displays (HMDs).

Unimodal and Multimodal Static Facial Expression Recognition for Virtual Reality Users with EmoHeVRDB

TL;DR

This work addresses the challenge of facial expression recognition in virtual reality where head-mounted displays occlude the upper face, by leveraging Facial Expression Activations (FEAs) from the Meta Quest Pro and the EmoHeVRDB dataset. It demonstrates unimodal FER using FEAs, achieving a peak of 73.02% accuracy with logistic regression, and shows that FEA-based methods can surpass image-based FER on EmoHeVRDB by up to 3.18%. Building on this, the authors fuse FEAs with image data through late and intermediate fusion strategies, with intermediate fusion reaching 80.42% accuracy and outperforming unimodal baselines. The results establish new VR FER benchmarks using EmoHeVRDB and highlight the value of multimodal fusion to mitigate HMD-induced occlusion, offering practical implications for emotion-aware VR applications and future work on dynamic, sequential FER.

Abstract

In this study, we explored the potential of utilizing Facial Expression Activations (FEAs) captured via the Meta Quest Pro Virtual Reality (VR) headset for Facial Expression Recognition (FER) in VR settings. Leveraging the EmojiHeroVR Database (EmoHeVRDB), we compared several unimodal approaches and achieved up to 73.02% accuracy for the static FER task with seven emotion categories. Furthermore, we integrated FEA and image data in multimodal approaches, observing significant improvements in recognition accuracy. An intermediate fusion approach achieved the highest accuracy of 80.42%, significantly surpassing the baseline evaluation result of 69.84% reported for EmoHeVRDB's image data. Our study is the first to utilize EmoHeVRDB's unique FEA data for unimodal and multimodal static FER, establishing new benchmarks for FER in VR settings. Our findings highlight the potential of fusing complementary modalities to enhance FER accuracy in VR settings, where conventional image-based methods are severely limited by the occlusion caused by Head-Mounted Displays (HMDs).

Paper Structure

This paper contains 13 sections, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Exemplary central-view and side-view images of the happiness class from EmoHeVRDB.
  • Figure 2: Face blend shape for the left inner brow raiser from the Face Tracking API's documentation meta_face_tracking.
  • Figure 3: Sample distribution for EmoHeVRDBs FEA data.
  • Figure 4: Agreement between our FEA-based MLP and our image-based EfficientNet-B0 ortmannEmojiherovr2024 on EmoHeVRDB's test set (c=central-view, s=side-view; colors from bottom to top: green=both correct, orange=only FEA model correct, blue=only image model correct, red=both incorrect).