Table of Contents
Fetching ...

How Do You Perceive My Face? Recognizing Facial Expressions in Multi-Modal Context by Modeling Mental Representations

Florian Blume, Runfeng Qu, Pia Bideau, Martin Maier, Rasha Abdel Rahman, Olaf Hellwich

TL;DR

The paper tackles context-sensitive facial expression perception by learning joint representations of facial content and multi-modal context through a VAE-GAN framework. A novel Context-Attention Network dynamically adapts facial representations based on context, enabling simultaneous expression classification and visualization of context-augmented mental representations. The approach achieves state-of-the-art accuracy on RAVDESS and MEAD and is validated by a human rating study showing that generated expressions align with human perception under contextual influence. These results suggest practical implications for socially aware agents that can adapt to human mental and emotional states while providing interpretable visualizations of their decision-making.

Abstract

Facial expression perception in humans inherently relies on prior knowledge and contextual cues, contributing to efficient and flexible processing. For instance, multi-modal emotional context (such as voice color, affective text, body pose, etc.) can prompt people to perceive emotional expressions in objectively neutral faces. Drawing inspiration from this, we introduce a novel approach for facial expression classification that goes beyond simple classification tasks. Our model accurately classifies a perceived face and synthesizes the corresponding mental representation perceived by a human when observing a face in context. With this, our model offers visual insights into its internal decision-making process. We achieve this by learning two independent representations of content and context using a VAE-GAN architecture. Subsequently, we propose a novel attention mechanism for context-dependent feature adaptation. The adapted representation is used for classification and to generate a context-augmented expression. We evaluate synthesized expressions in a human study, showing that our model effectively produces approximations of human mental representations. We achieve State-of-the-Art classification accuracies of 81.01% on the RAVDESS dataset and 79.34% on the MEAD dataset. We make our code publicly available.

How Do You Perceive My Face? Recognizing Facial Expressions in Multi-Modal Context by Modeling Mental Representations

TL;DR

The paper tackles context-sensitive facial expression perception by learning joint representations of facial content and multi-modal context through a VAE-GAN framework. A novel Context-Attention Network dynamically adapts facial representations based on context, enabling simultaneous expression classification and visualization of context-augmented mental representations. The approach achieves state-of-the-art accuracy on RAVDESS and MEAD and is validated by a human rating study showing that generated expressions align with human perception under contextual influence. These results suggest practical implications for socially aware agents that can adapt to human mental and emotional states while providing interpretable visualizations of their decision-making.

Abstract

Facial expression perception in humans inherently relies on prior knowledge and contextual cues, contributing to efficient and flexible processing. For instance, multi-modal emotional context (such as voice color, affective text, body pose, etc.) can prompt people to perceive emotional expressions in objectively neutral faces. Drawing inspiration from this, we introduce a novel approach for facial expression classification that goes beyond simple classification tasks. Our model accurately classifies a perceived face and synthesizes the corresponding mental representation perceived by a human when observing a face in context. With this, our model offers visual insights into its internal decision-making process. We achieve this by learning two independent representations of content and context using a VAE-GAN architecture. Subsequently, we propose a novel attention mechanism for context-dependent feature adaptation. The adapted representation is used for classification and to generate a context-augmented expression. We evaluate synthesized expressions in a human study, showing that our model effectively produces approximations of human mental representations. We achieve State-of-the-Art classification accuracies of 81.01% on the RAVDESS dataset and 79.34% on the MEAD dataset. We make our code publicly available.
Paper Structure (23 sections, 10 equations, 17 figures, 4 tables)

This paper contains 23 sections, 10 equations, 17 figures, 4 tables.

Figures (17)

  • Figure 1: Visualization of the influence of context on human perception of facial expressions. The mental representation shifts congruently with context.
  • Figure 2: Overview of our full network architecture. The face and context reconstruction networks face_reconst_net and context_reconst_net are a - combination. The mean and variance of a facial input image and audio context (Mel spectrogram) are adapted by the context_att_net, which shifts the face representation using the context representation. The classification head (E) classifies the shifted features and we use the fixed decoder of (A1) to visualize the expression.
  • Figure 3: Detailed view of the from \ref{['sec:method:pipeline_overview']} for adapting the means. $\odot$ is element-wise multiplication, $\oplus$ addition. We left out index $\mu$ on the weights for simplicity. The shifted variance is computed analogously.
  • Figure 4: Confusion matrices for our model. (a) Uni-modal setting with simulated missing context. (b) Multi-modal setting showing the clear diagonal.
  • Figure 5: Comparison of visualizations of all samples (i.e. frames plus audio) from the identities 01, 02 and 03.
  • ...and 12 more figures