Table of Contents
Fetching ...

SpikEmo: Enhancing Emotion Recognition With Spiking Temporal Dynamics in Conversations

Xiaomin Yu, Feiyang Wang, Ziyue Qiao

TL;DR

The SpikEmo framework, which is based on spiking neurons and employs a Semantic&Dynamic Two-stage Modeling approach to more precisely capture the complex temporal features of multimodal emotional data, significantly outperforms existing state-of-the-art methods in ERC tasks.

Abstract

In affective computing, the task of Emotion Recognition in Conversations (ERC) has emerged as a focal area of research. The primary objective of this task is to predict emotional states within conversations by analyzing multimodal data including text, audio, and video. While existing studies have progressed in extracting and fusing representations from multimodal data, they often overlook the temporal dynamics in the data during conversations. To address this challenge, we have developed the SpikEmo framework, which is based on spiking neurons and employs a Semantic & Dynamic Two-stage Modeling approach to more precisely capture the complex temporal features of multimodal emotional data. Additionally, to tackle the class imbalance and emotional semantic similarity problems in the ERC tasks, we have devised an innovative combination of loss functions that significantly enhances the model's performance when dealing with ERC data characterized by long-tail distributions. Extensive experiments conducted on multiple ERC benchmark datasets demonstrate that SpikEmo significantly outperforms existing state-of-the-art methods in ERC tasks. Our code is available at https://github.com/Yu-xm/SpikEmo.git.

SpikEmo: Enhancing Emotion Recognition With Spiking Temporal Dynamics in Conversations

TL;DR

The SpikEmo framework, which is based on spiking neurons and employs a Semantic&Dynamic Two-stage Modeling approach to more precisely capture the complex temporal features of multimodal emotional data, significantly outperforms existing state-of-the-art methods in ERC tasks.

Abstract

In affective computing, the task of Emotion Recognition in Conversations (ERC) has emerged as a focal area of research. The primary objective of this task is to predict emotional states within conversations by analyzing multimodal data including text, audio, and video. While existing studies have progressed in extracting and fusing representations from multimodal data, they often overlook the temporal dynamics in the data during conversations. To address this challenge, we have developed the SpikEmo framework, which is based on spiking neurons and employs a Semantic & Dynamic Two-stage Modeling approach to more precisely capture the complex temporal features of multimodal emotional data. Additionally, to tackle the class imbalance and emotional semantic similarity problems in the ERC tasks, we have devised an innovative combination of loss functions that significantly enhances the model's performance when dealing with ERC data characterized by long-tail distributions. Extensive experiments conducted on multiple ERC benchmark datasets demonstrate that SpikEmo significantly outperforms existing state-of-the-art methods in ERC tasks. Our code is available at https://github.com/Yu-xm/SpikEmo.git.

Paper Structure

This paper contains 20 sections, 13 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: An example of the ERC task from the MELD dataset.
  • Figure 2: The data distribution of MELD and IEMOCAP. Both datasets exhibit a noticeable long-tail distribution.
  • Figure 3: Overall framework of SpikEmo. The modality level Semantic modeling extracts the contextualized modality representations, the feature level dynamic contextualized modeling extracts cross-modality temporal information, and the $L_{corr}$ and $L_{DSC}$ losses are proposed to capture correlations and avoid the long-tail problem in model training.
  • Figure 4: The impact of T settings on model performance.