Table of Contents
Fetching ...

BEE-NET: A deep neural network to identify in-the-wild Bodily Expression of Emotions

Mohammad Mahdi Dehshibi, David Masip

TL;DR

This work tackles automatic identification of in-the-wild bodily expressions of emotions (AIBEE) and the influence of environmental context. It introduces BEE-NET, a three-stream CNN that fuses scene/place and object cues with the emotion stream via a differentiable Bayesian-inspired late fusion (probabilistic pooling) to model joint and conditional relationships. On BoLD, BEE-NET achieves an Emotion Recognition Score (ERS) of 66.33%, surpassing prior state-of-the-art by about 2.07%, with ablations confirming the critical roles of place context and the proposed fusion scheme. The approach demonstrates that context-aware, end-to-end learning can significantly improve robustness of AIBEE for real-world applications.

Abstract

In this study, we investigate how environmental factors, specifically the scenes and objects involved, can affect the expression of emotions through body language. To this end, we introduce a novel multi-stream deep convolutional neural network named BEE-NET. We also propose a new late fusion strategy that incorporates meta-information on places and objects as prior knowledge in the learning process. Our proposed probabilistic pooling model leverages this information to generate a joint probability distribution of both available and anticipated non-available contextual information in latent space. Importantly, our fusion strategy is differentiable, allowing for end-to-end training and capturing of hidden associations among data points without requiring further post-processing or regularisation. To evaluate our deep model, we use the Body Language Database (BoLD), which is currently the largest available database for the Automatic Identification of the in-the-wild Bodily Expression of Emotions (AIBEE). Our experimental results demonstrate that our proposed approach surpasses the current state-of-the-art in AIBEE by a margin of 2.07%, achieving an Emotional Recognition Score of 66.33%.

BEE-NET: A deep neural network to identify in-the-wild Bodily Expression of Emotions

TL;DR

This work tackles automatic identification of in-the-wild bodily expressions of emotions (AIBEE) and the influence of environmental context. It introduces BEE-NET, a three-stream CNN that fuses scene/place and object cues with the emotion stream via a differentiable Bayesian-inspired late fusion (probabilistic pooling) to model joint and conditional relationships. On BoLD, BEE-NET achieves an Emotion Recognition Score (ERS) of 66.33%, surpassing prior state-of-the-art by about 2.07%, with ablations confirming the critical roles of place context and the proposed fusion scheme. The approach demonstrates that context-aware, end-to-end learning can significantly improve robustness of AIBEE for real-world applications.

Abstract

In this study, we investigate how environmental factors, specifically the scenes and objects involved, can affect the expression of emotions through body language. To this end, we introduce a novel multi-stream deep convolutional neural network named BEE-NET. We also propose a new late fusion strategy that incorporates meta-information on places and objects as prior knowledge in the learning process. Our proposed probabilistic pooling model leverages this information to generate a joint probability distribution of both available and anticipated non-available contextual information in latent space. Importantly, our fusion strategy is differentiable, allowing for end-to-end training and capturing of hidden associations among data points without requiring further post-processing or regularisation. To evaluate our deep model, we use the Body Language Database (BoLD), which is currently the largest available database for the Automatic Identification of the in-the-wild Bodily Expression of Emotions (AIBEE). Our experimental results demonstrate that our proposed approach surpasses the current state-of-the-art in AIBEE by a margin of 2.07%, achieving an Emotional Recognition Score of 66.33%.
Paper Structure (12 sections, 16 equations, 5 figures, 4 tables)

This paper contains 12 sections, 16 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: (a) A sample from the BoLD database that mainly represents happiness. (b) Top-10 place tags were obtained by applying Places-CNN trained on the Places2 database to the input. (c) Object tags were obtained by applying YOLO trained on the Microsoft COCO database to the input.
  • Figure 2: (a) BEE-NET architecture for the identification of in-the-wild bodily expression of emotions. Place and Object streams have a shade of grey in this schematic pipeline to highlight frozen layers at a zero learning rate during the training phase. (b) The pseudo-colour plot of the conditional probability of the emotion tags given the place and object tags. The y-axis represents the probability of pseudo-ground-truth for the place and objects ($\kappa=365+80$), while the x-axis represents 26 discrete emotions. Note that the darker the blue, the lower the probability values.
  • Figure 3: Cumulative probability of (a) labels in BoLD database, (b) pseudo-tag provided by applying place-CNN to BoLD database, and (c) pseudo-tag provided by applying YOLO object detector to BoLD database.
  • Figure 4: The emotion recognition score for the proposed architecture as a function of $(\kappa, \lambda)$. The best trade-off between these two hyper-parameters is $(\kappa, \lambda) = (56, 0.2)$, resulting in an ERS value of 83.64%.
  • Figure 5: Classification performance for discrete emotions is reported based on the average precision (AP) in the [first row] and area under the receiver's operating characteristic curve (RA) in the [second row]. The regression performance for continuous emotions is reported on the basis of the $R^{2}$ score in the [third row].