Table of Contents
Fetching ...

ERNetCL: A novel emotion recognition network in textual conversation based on curriculum learning strategy

Jiang Li, Xiaoping Wang, Yingjian Liu, Zhigang Zeng

TL;DR

ERNetCL addresses ERC by jointly modeling temporal and spatial context through a GRU-based temporal encoder and a multi-head attention spatial encoder, while mitigating emotion-shift via a curriculum learning loss. The difficulty is quantified using emotion-shift frequency within conversations, guiding epoch-dependent sample weighting that progressively exposes harder cases. Empirical results on MELD, IEMOCAP, EmoryNLP, and DailyDialog show ERNetCL achieves superior or competitive performance, with ablations confirming the benefits of TE, SE, and CL. The approach offers a lightweight yet effective alternative to complex ERC architectures and suggests promising directions for multimodal and contrastive learning in conversation understanding.

Abstract

Emotion recognition in conversation (ERC) has emerged as a research hotspot in domains such as conversational robots and question-answer systems. How to efficiently and adequately retrieve contextual emotional cues has been one of the key challenges in the ERC task. Existing efforts do not fully model the context and employ complex network structures, resulting in limited performance gains. In this paper, we propose a novel emotion recognition network based on curriculum learning strategy (ERNetCL). The proposed ERNetCL primarily consists of temporal encoder (TE), spatial encoder (SE), and curriculum learning (CL) loss. We utilize TE and SE to combine the strengths of previous methods in a simplistic manner to efficiently capture temporal and spatial contextual information in the conversation. To ease the harmful influence resulting from emotion shift and simulate the way humans learn curriculum from easy to hard, we apply the idea of CL to the ERC task to progressively optimize the network parameters. At the beginning of training, we assign lower learning weights to difficult samples. As the epoch increases, the learning weights for these samples are gradually raised. Extensive experiments on four datasets exhibit that our proposed method is effective and dramatically beats other baseline models.

ERNetCL: A novel emotion recognition network in textual conversation based on curriculum learning strategy

TL;DR

ERNetCL addresses ERC by jointly modeling temporal and spatial context through a GRU-based temporal encoder and a multi-head attention spatial encoder, while mitigating emotion-shift via a curriculum learning loss. The difficulty is quantified using emotion-shift frequency within conversations, guiding epoch-dependent sample weighting that progressively exposes harder cases. Empirical results on MELD, IEMOCAP, EmoryNLP, and DailyDialog show ERNetCL achieves superior or competitive performance, with ablations confirming the benefits of TE, SE, and CL. The approach offers a lightweight yet effective alternative to complex ERC architectures and suggests promising directions for multimodal and contrastive learning in conversation understanding.

Abstract

Emotion recognition in conversation (ERC) has emerged as a research hotspot in domains such as conversational robots and question-answer systems. How to efficiently and adequately retrieve contextual emotional cues has been one of the key challenges in the ERC task. Existing efforts do not fully model the context and employ complex network structures, resulting in limited performance gains. In this paper, we propose a novel emotion recognition network based on curriculum learning strategy (ERNetCL). The proposed ERNetCL primarily consists of temporal encoder (TE), spatial encoder (SE), and curriculum learning (CL) loss. We utilize TE and SE to combine the strengths of previous methods in a simplistic manner to efficiently capture temporal and spatial contextual information in the conversation. To ease the harmful influence resulting from emotion shift and simulate the way humans learn curriculum from easy to hard, we apply the idea of CL to the ERC task to progressively optimize the network parameters. At the beginning of training, we assign lower learning weights to difficult samples. As the epoch increases, the learning weights for these samples are gradually raised. Extensive experiments on four datasets exhibit that our proposed method is effective and dramatically beats other baseline models.
Paper Structure (22 sections, 8 equations, 9 figures, 5 tables)

This paper contains 22 sections, 8 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: A conversational scenario. Combining current and contextual information is needed to comprehensively determine the emotion of the utterance to be predicted.
  • Figure 2: The overall architecture of our proposed ERNetCL. The proposed method sequentially abstracts temporal and spatial contextual cues through temporal and spatial encoders. In the training phase, the curriculum learning loss is adopted to optimize the network parameters instead of the original loss.
  • Figure 3: F1 score for each emotion on the MELD and IEMOCAP datasets. Our model's F1 scores for all emotions are higher than AGHMN's results.
  • Figure 4: T-SNE visualization of IEMOCAP before and after feature extraction.
  • Figure 5: Results after removing each component on the EmoryNLP and DailyDialog datasets.
  • ...and 4 more figures