Table of Contents
Fetching ...

CFN-ESA: A Cross-Modal Fusion Network with Emotion-Shift Awareness for Dialogue Emotion Recognition

Jiang Li, Xiaoping Wang, Yingjian Liu, Zhigang Zeng

TL;DR

This work tackles emotion recognition in conversation by addressing two limits of prior multimodal ERC approaches: equal treatment of all modalities and neglect of emotion-shift information. It introduces CFN-ESA, a three-component network with a recurrence-based unimodal encoder (RUME), an attention-based cross-modal encoder (ACME) anchored on textual information, and a label-based emotion-shift module (LESM) that serves as an auxiliary task to guide learning under shifting emotions. Empirical results on MELD and IEMOCAP show state-of-the-art performance, with ablations confirming that each component contributes to improved accuracy and F1 scores, and analyses demonstrate robust benefits of tri-modal fusion and shift-aware training. The method advances practical ERC by enhancing cross-modal complementarity and explicitly modeling emotion shifts, with implications for more robust human-machine dialogue systems.

Abstract

Multimodal emotion recognition in conversation (ERC) has garnered growing attention from research communities in various fields. In this paper, we propose a Cross-modal Fusion Network with Emotion-Shift Awareness (CFN-ESA) for ERC. Extant approaches employ each modality equally without distinguishing the amount of emotional information in these modalities, rendering it hard to adequately extract complementary information from multimodal data. To cope with this problem, in CFN-ESA, we treat textual modality as the primary source of emotional information, while visual and acoustic modalities are taken as the secondary sources. Besides, most multimodal ERC models ignore emotion-shift information and overfocus on contextual information, leading to the failure of emotion recognition under emotion-shift scenario. We elaborate an emotion-shift module to address this challenge. CFN-ESA mainly consists of unimodal encoder (RUME), cross-modal encoder (ACME), and emotion-shift module (LESM). RUME is applied to extract conversation-level contextual emotional cues while pulling together data distributions between modalities; ACME is utilized to perform multimodal interaction centered on textual modality; LESM is used to model emotion shift and capture emotion-shift information, thereby guiding the learning of the main task. Experimental results demonstrate that CFN-ESA can effectively promote performance for ERC and remarkably outperform state-of-the-art models.

CFN-ESA: A Cross-Modal Fusion Network with Emotion-Shift Awareness for Dialogue Emotion Recognition

TL;DR

This work tackles emotion recognition in conversation by addressing two limits of prior multimodal ERC approaches: equal treatment of all modalities and neglect of emotion-shift information. It introduces CFN-ESA, a three-component network with a recurrence-based unimodal encoder (RUME), an attention-based cross-modal encoder (ACME) anchored on textual information, and a label-based emotion-shift module (LESM) that serves as an auxiliary task to guide learning under shifting emotions. Empirical results on MELD and IEMOCAP show state-of-the-art performance, with ablations confirming that each component contributes to improved accuracy and F1 scores, and analyses demonstrate robust benefits of tri-modal fusion and shift-aware training. The method advances practical ERC by enhancing cross-modal complementarity and explicitly modeling emotion shifts, with implications for more robust human-machine dialogue systems.

Abstract

Multimodal emotion recognition in conversation (ERC) has garnered growing attention from research communities in various fields. In this paper, we propose a Cross-modal Fusion Network with Emotion-Shift Awareness (CFN-ESA) for ERC. Extant approaches employ each modality equally without distinguishing the amount of emotional information in these modalities, rendering it hard to adequately extract complementary information from multimodal data. To cope with this problem, in CFN-ESA, we treat textual modality as the primary source of emotional information, while visual and acoustic modalities are taken as the secondary sources. Besides, most multimodal ERC models ignore emotion-shift information and overfocus on contextual information, leading to the failure of emotion recognition under emotion-shift scenario. We elaborate an emotion-shift module to address this challenge. CFN-ESA mainly consists of unimodal encoder (RUME), cross-modal encoder (ACME), and emotion-shift module (LESM). RUME is applied to extract conversation-level contextual emotional cues while pulling together data distributions between modalities; ACME is utilized to perform multimodal interaction centered on textual modality; LESM is used to model emotion shift and capture emotion-shift information, thereby guiding the learning of the main task. Experimental results demonstrate that CFN-ESA can effectively promote performance for ERC and remarkably outperform state-of-the-art models.
Paper Structure (31 sections, 14 equations, 14 figures, 8 tables)

This paper contains 31 sections, 14 equations, 14 figures, 8 tables.

Figures (14)

  • Figure 1: A conversational scene from the MELD dataset. If only textual modality is taken into account, the emotion of $u_5$ may be recognized as neutral. From the facial expression of the speaker who utters $u_5$, it is known that the emotion should be anger, which is true emotion of the utterance.
  • Figure 2: The overall architecture of our CFN-ESA. First, the utterance-level features of visual, textual, and acoustic modalities are extracted by DenseNet, RoBERTa, and OpenSmile, respectively; second, the intra-modal contextual information and inter-modal complementary information are captured by uni-modality encoder and cross-modality encoder in turn; then, the optimization of the utterance expression is performed by utilizing the emotion-shift module; finally, the emotion classifier is adopted for prediction.
  • Figure 3: The network structure of RUME. Note that RUME shares parameters for each modality, and $\oplus$ denotes the residual operation.
  • Figure 4: The network structure of ACME. (a), (b), and (c) show the structure for visual, textual, and acoustic information updating in ACME, respectively. Note that the information updating network for visual modality is similar to that for acoustic modality.
  • Figure 5: An example of constructing emotion-shift probability tensor $\mathcal{T}_{123}$. Here, $\mathcal{T}_{123}$ can be viewed as a $3 \times 3$ dimensional matrix composed of feature vectors (emotion-shift probability vectors) that are concatenated from the feature vectors of utterances.
  • ...and 9 more figures