Table of Contents
Fetching ...

Fusion in Context: A Multimodal Approach to Affective State Recognition

Youssef Mohamed, Severin Lemaignan, Arzu Guneysu, Patric Jensfelt, Christian Smith

TL;DR

This paper proposes a transformer-based multimodal fusion approach that leverages facial thermal data, facial action units, and textual context information for context-aware emotion recognition and demonstrates improvements from incorporating contextual information and multimodal fusion.

Abstract

Accurate recognition of human emotions is a crucial challenge in affective computing and human-robot interaction (HRI). Emotional states play a vital role in shaping behaviors, decisions, and social interactions. However, emotional expressions can be influenced by contextual factors, leading to misinterpretations if context is not considered. Multimodal fusion, combining modalities like facial expressions, speech, and physiological signals, has shown promise in improving affect recognition. This paper proposes a transformer-based multimodal fusion approach that leverages facial thermal data, facial action units, and textual context information for context-aware emotion recognition. We explore modality-specific encoders to learn tailored representations, which are then fused using additive fusion and processed by a shared transformer encoder to capture temporal dependencies and interactions. The proposed method is evaluated on a dataset collected from participants engaged in a tangible tabletop Pacman game designed to induce various affective states. Our results demonstrate the effectiveness of incorporating contextual information and multimodal fusion for affective state recognition.

Fusion in Context: A Multimodal Approach to Affective State Recognition

TL;DR

This paper proposes a transformer-based multimodal fusion approach that leverages facial thermal data, facial action units, and textual context information for context-aware emotion recognition and demonstrates improvements from incorporating contextual information and multimodal fusion.

Abstract

Accurate recognition of human emotions is a crucial challenge in affective computing and human-robot interaction (HRI). Emotional states play a vital role in shaping behaviors, decisions, and social interactions. However, emotional expressions can be influenced by contextual factors, leading to misinterpretations if context is not considered. Multimodal fusion, combining modalities like facial expressions, speech, and physiological signals, has shown promise in improving affect recognition. This paper proposes a transformer-based multimodal fusion approach that leverages facial thermal data, facial action units, and textual context information for context-aware emotion recognition. We explore modality-specific encoders to learn tailored representations, which are then fused using additive fusion and processed by a shared transformer encoder to capture temporal dependencies and interactions. The proposed method is evaluated on a dataset collected from participants engaged in a tangible tabletop Pacman game designed to induce various affective states. Our results demonstrate the effectiveness of incorporating contextual information and multimodal fusion for affective state recognition.
Paper Structure (27 sections, 3 equations, 3 figures, 2 tables)

This paper contains 27 sections, 3 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Facial landmarks and Action Units extracted using OpenFace (left) and thermal regions of interest (right)mohamed2023multi.
  • Figure 2: Multimodal Transformer Architecture: Integrates action units (16), facial thermal data (144), and context text embeddings (3072) through modality-specific encoders. Additive fusion processed by transformer encoder with positional encoding, followed by classification head for 4 affective states prediction.
  • Figure 3: Confusion Matrix for Thermal + AU + FC configuration.