Table of Contents
Fetching ...

EVOKE: Emotion Enabled Virtual Avatar Mapping Using Optimized Knowledge Distillation

Maryam Nadeem, Raza Imam, Rouqaiah Al-Refai, Meriem Chkir, Mohamad Hoda, Abdulmotaleb El Saddik

TL;DR

EVOKE tackles EEG-based emotion recognition for emotionally expressive virtual avatars in resource-limited settings. It introduces a knowledge-distillation pipeline where a heavy CCNN teacher guides a compact two-convolution student to perform multi-label classification of valence, arousal, and dominance, achieving $18$-fold parameter reduction while attaining about $87\%$ accuracy on the DEAP dataset. The approach uses differential-entropy features arranged into $4\times9\times9$ grids and maps the resulting outputs to eight discrete emotions for avatar control, with a temperature parameter $T$ and weight $\alpha$ tuned for best performance (e.g., $T=1.25$, $\alpha=0.25$). It achieves fast inference times (as low as $0.33$ ms) and high throughput ($8.0176\times10^{4}$ samples/s) on a single RTX A6000, enabling real-time deployment in virtual environments and potential applications in healthcare and therapy.

Abstract

As virtual environments continue to advance, the demand for immersive and emotionally engaging experiences has grown. Addressing this demand, we introduce Emotion enabled Virtual avatar mapping using Optimized KnowledgE distillation (EVOKE), a lightweight emotion recognition framework designed for the seamless integration of emotion recognition into 3D avatars within virtual environments. Our approach leverages knowledge distillation involving multi-label classification on the publicly available DEAP dataset, which covers valence, arousal, and dominance as primary emotional classes. Remarkably, our distilled model, a CNN with only two convolutional layers and 18 times fewer parameters than the teacher model, achieves competitive results, boasting an accuracy of 87% while demanding far less computational resources. This equilibrium between performance and deployability positions our framework as an ideal choice for virtual environment systems. Furthermore, the multi-label classification outcomes are utilized to map emotions onto custom-designed 3D avatars.

EVOKE: Emotion Enabled Virtual Avatar Mapping Using Optimized Knowledge Distillation

TL;DR

EVOKE tackles EEG-based emotion recognition for emotionally expressive virtual avatars in resource-limited settings. It introduces a knowledge-distillation pipeline where a heavy CCNN teacher guides a compact two-convolution student to perform multi-label classification of valence, arousal, and dominance, achieving -fold parameter reduction while attaining about accuracy on the DEAP dataset. The approach uses differential-entropy features arranged into grids and maps the resulting outputs to eight discrete emotions for avatar control, with a temperature parameter and weight tuned for best performance (e.g., , ). It achieves fast inference times (as low as ms) and high throughput ( samples/s) on a single RTX A6000, enabling real-time deployment in virtual environments and potential applications in healthcare and therapy.

Abstract

As virtual environments continue to advance, the demand for immersive and emotionally engaging experiences has grown. Addressing this demand, we introduce Emotion enabled Virtual avatar mapping using Optimized KnowledgE distillation (EVOKE), a lightweight emotion recognition framework designed for the seamless integration of emotion recognition into 3D avatars within virtual environments. Our approach leverages knowledge distillation involving multi-label classification on the publicly available DEAP dataset, which covers valence, arousal, and dominance as primary emotional classes. Remarkably, our distilled model, a CNN with only two convolutional layers and 18 times fewer parameters than the teacher model, achieves competitive results, boasting an accuracy of 87% while demanding far less computational resources. This equilibrium between performance and deployability positions our framework as an ideal choice for virtual environment systems. Furthermore, the multi-label classification outcomes are utilized to map emotions onto custom-designed 3D avatars.
Paper Structure (17 sections, 7 equations, 4 figures, 1 table)

This paper contains 17 sections, 7 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Inference speed of teacher and student models. Distilled student model can be deployed for real-time applications, as smaller models are faster with lower inference times.
  • Figure 2: Our proposed framework, EVOKE. (1) We input the raw EEG signals to the framework which initially goes for preprocessing and label thresholding. (2) Preprocessing includes stages of feature extraction using differential entropy to have 4 channel bands followed by noise reduction resulting into 3D grid input. (3) Features input to teacher CCNN generates soft labels with Temperature $(T)>1$ which calculates the Binary Cross-Entropy with logits loss ($\mathcal{L}_{1}$). (4) Final distillation loss ($\mathcal{L}_{distill}$) is a weighted combination of soft target loss ($\mathcal{L}_{1}$) and final loss ($\mathcal{L}_{2}$) (refer Eq. \ref{['kd_loss']}). (5) The distilled model trained on the soft predictions of teacher model is then (6) deployed for fast inference and real-time applications.
  • Figure 3: Multi-label classification results are mapped to eight emotions based on different combinations of valence, arousal, and dominance and further associated with 3D avatars through a hashing process.
  • Figure 4: Performance analysis in terms of accuracy and F1 score, respectively, across various values of temperature ($T$) (a) and (c) parameter and weight factor $\alpha$ (see Eq. \ref{['kd_loss']}) (b) and (d). Note that the accuracy and F1 scores presented in the figures are based on the mean values obtained from a 5-fold cross-validation evaluation.