Table of Contents
Fetching ...

GatedxLSTM: A Multimodal Affective Computing Approach for Emotion Recognition in Conversations

Yupei Li, Qiyang Sun, Sunil Munthumoduku Krishna Murthy, Emran Alturki, Björn W. Schuller

TL;DR

This work tackles the dynamic, multimodal nature of emotion in conversations by introducing GatedxLSTM, an xLSTM-based ERC model that fuses speaker and interlocutor audio and transcripts via CLAP for cross-modal alignment and a gating mechanism to highlight emotionally impactful utterances. A Dialogical Emotion Decoder (DED) provides context-aware post-processing to refine predictions across dialogue turns. On the IEMOCAP dataset, the approach achieves state-of-the-art open-source performance for four-class emotion classification, with ablation analyses confirming the additive impact of CLAP alignment, gating, and DED. The model also offers interpretable insights into the relative influence of speaker vs. interlocutor modalities, contributing to the understanding of affective dynamics and supporting practical ERC deployments in real-time or AGI-related settings.

Abstract

Affective Computing (AC) is essential for advancing Artificial General Intelligence (AGI), with emotion recognition serving as a key component. However, human emotions are inherently dynamic, influenced not only by an individual's expressions but also by interactions with others, and single-modality approaches often fail to capture their full dynamics. Multimodal Emotion Recognition (MER) leverages multiple signals but traditionally relies on utterance-level analysis, overlooking the dynamic nature of emotions in conversations. Emotion Recognition in Conversation (ERC) addresses this limitation, yet existing methods struggle to align multimodal features and explain why emotions evolve within dialogues. To bridge this gap, we propose GatedxLSTM, a novel speech-text multimodal ERC model that explicitly considers voice and transcripts of both the speaker and their conversational partner(s) to identify the most influential sentences driving emotional shifts. By integrating Contrastive Language-Audio Pretraining (CLAP) for improved cross-modal alignment and employing a gating mechanism to emphasise emotionally impactful utterances, GatedxLSTM enhances both interpretability and performance. Additionally, the Dialogical Emotion Decoder (DED) refines emotion predictions by modelling contextual dependencies. Experiments on the IEMOCAP dataset demonstrate that GatedxLSTM achieves state-of-the-art (SOTA) performance among open-source methods in four-class emotion classification. These results validate its effectiveness for ERC applications and provide an interpretability analysis from a psychological perspective.

GatedxLSTM: A Multimodal Affective Computing Approach for Emotion Recognition in Conversations

TL;DR

This work tackles the dynamic, multimodal nature of emotion in conversations by introducing GatedxLSTM, an xLSTM-based ERC model that fuses speaker and interlocutor audio and transcripts via CLAP for cross-modal alignment and a gating mechanism to highlight emotionally impactful utterances. A Dialogical Emotion Decoder (DED) provides context-aware post-processing to refine predictions across dialogue turns. On the IEMOCAP dataset, the approach achieves state-of-the-art open-source performance for four-class emotion classification, with ablation analyses confirming the additive impact of CLAP alignment, gating, and DED. The model also offers interpretable insights into the relative influence of speaker vs. interlocutor modalities, contributing to the understanding of affective dynamics and supporting practical ERC deployments in real-time or AGI-related settings.

Abstract

Affective Computing (AC) is essential for advancing Artificial General Intelligence (AGI), with emotion recognition serving as a key component. However, human emotions are inherently dynamic, influenced not only by an individual's expressions but also by interactions with others, and single-modality approaches often fail to capture their full dynamics. Multimodal Emotion Recognition (MER) leverages multiple signals but traditionally relies on utterance-level analysis, overlooking the dynamic nature of emotions in conversations. Emotion Recognition in Conversation (ERC) addresses this limitation, yet existing methods struggle to align multimodal features and explain why emotions evolve within dialogues. To bridge this gap, we propose GatedxLSTM, a novel speech-text multimodal ERC model that explicitly considers voice and transcripts of both the speaker and their conversational partner(s) to identify the most influential sentences driving emotional shifts. By integrating Contrastive Language-Audio Pretraining (CLAP) for improved cross-modal alignment and employing a gating mechanism to emphasise emotionally impactful utterances, GatedxLSTM enhances both interpretability and performance. Additionally, the Dialogical Emotion Decoder (DED) refines emotion predictions by modelling contextual dependencies. Experiments on the IEMOCAP dataset demonstrate that GatedxLSTM achieves state-of-the-art (SOTA) performance among open-source methods in four-class emotion classification. These results validate its effectiveness for ERC applications and provide an interpretability analysis from a psychological perspective.

Paper Structure

This paper contains 17 sections, 7 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: The pipeline of GatedxLSTM. First, each audio sample and its corresponding transcription are processed using the CLAP model, which maps them into a shared embedding space and aligns their embeddings. Next, for a given utterance at time $T$, we identify its preceding utterance spoken by the interlocutor. Both the audio and text representations are then passed through four distinct xLSTM blocks to extract features. To incorporate contextual information, we retrieve relevant features from several preceding utterances (e. g., at $T-1$). These extracted features are then processed through a gating mechanism to determine their contribution to the final emotion recognition task, assigning the audio of the current utterance a weight of 1. After applying the gating mechanism, the features are passed through a fully connected layer. Finally, the prediction is refined using DED in the last stage of the pipeline.
  • Figure 2: Average absolute weights for each output neuron. '0' represents the current speaker, '1' represents the interlocutor. 'A' denotes audio, 'T' denotes text, and 'k' denotes the time frame.