Table of Contents
Fetching ...

LaERC-S: Improving LLM-based Emotion Recognition in Conversation with Speaker Characteristics

Yumeng Fu, Junjie Wu, Zhongjie Wang, Meishan Zhang, Lili Shan, Yulin Wu, Bingquan Li

TL;DR

LaERC-S addresses emotion recognition in conversation by leveraging large language models to extract dynamic speaker characteristics (mental state, behavior, persona) and a two-stage learning process to inject this knowledge for robust emotion prediction. The framework uses carefully designed prompts and templates to elicit targeted speaker cues, then trains with an instruction-tuning objective that culminates in emotion classification guided by an explicit $oReact$ cue, formalized as $L_k = \sum_{i'}^{j} - \mathrm{log}P(\mu_{(k,i')}|x_k,\theta_k)$. Evaluations on IEMOCAP, MELD, and EmoryNLP show state-of-the-art weighted-F1 scores with strong ablations demonstrating the value of speaker characteristics and the two-stage approach; results are robust across datasets and model variations. The work highlights how integrating world-knowledge-driven, dynamic speaker information into ERC can yield more accurate and generalizable emotion understanding in conversations, with practical efficiency on a single GPU. Future directions include exploring richer expressions of speaker characteristics and extending the approach to additional NLP tasks that benefit from nuanced speaker modeling.

Abstract

Emotion recognition in conversation (ERC), the task of discerning human emotions for each utterance within a conversation, has garnered significant attention in human-computer interaction systems. Previous ERC studies focus on speaker-specific information that predominantly stems from relationships among utterances, which lacks sufficient information around conversations. Recent research in ERC has sought to exploit pre-trained large language models (LLMs) with speaker modelling to comprehend emotional states. Although these methods have achieved encouraging results, the extracted speaker-specific information struggles to indicate emotional dynamics. In this paper, motivated by the fact that speaker characteristics play a crucial role and LLMs have rich world knowledge, we present LaERC-S, a novel framework that stimulates LLMs to explore speaker characteristics involving the mental state and behavior of interlocutors, for accurate emotion predictions. To endow LLMs with this knowledge information, we adopt the two-stage learning to make the models reason speaker characteristics and track the emotion of the speaker in complex conversation scenarios. Extensive experiments on three benchmark datasets demonstrate the superiority of LaERC-S, reaching the new state-of-the-art.

LaERC-S: Improving LLM-based Emotion Recognition in Conversation with Speaker Characteristics

TL;DR

LaERC-S addresses emotion recognition in conversation by leveraging large language models to extract dynamic speaker characteristics (mental state, behavior, persona) and a two-stage learning process to inject this knowledge for robust emotion prediction. The framework uses carefully designed prompts and templates to elicit targeted speaker cues, then trains with an instruction-tuning objective that culminates in emotion classification guided by an explicit cue, formalized as . Evaluations on IEMOCAP, MELD, and EmoryNLP show state-of-the-art weighted-F1 scores with strong ablations demonstrating the value of speaker characteristics and the two-stage approach; results are robust across datasets and model variations. The work highlights how integrating world-knowledge-driven, dynamic speaker information into ERC can yield more accurate and generalizable emotion understanding in conversations, with practical efficiency on a single GPU. Future directions include exploring richer expressions of speaker characteristics and extending the approach to additional NLP tasks that benefit from nuanced speaker modeling.

Abstract

Emotion recognition in conversation (ERC), the task of discerning human emotions for each utterance within a conversation, has garnered significant attention in human-computer interaction systems. Previous ERC studies focus on speaker-specific information that predominantly stems from relationships among utterances, which lacks sufficient information around conversations. Recent research in ERC has sought to exploit pre-trained large language models (LLMs) with speaker modelling to comprehend emotional states. Although these methods have achieved encouraging results, the extracted speaker-specific information struggles to indicate emotional dynamics. In this paper, motivated by the fact that speaker characteristics play a crucial role and LLMs have rich world knowledge, we present LaERC-S, a novel framework that stimulates LLMs to explore speaker characteristics involving the mental state and behavior of interlocutors, for accurate emotion predictions. To endow LLMs with this knowledge information, we adopt the two-stage learning to make the models reason speaker characteristics and track the emotion of the speaker in complex conversation scenarios. Extensive experiments on three benchmark datasets demonstrate the superiority of LaERC-S, reaching the new state-of-the-art.
Paper Structure (31 sections, 2 equations, 4 figures, 10 tables)

This paper contains 31 sections, 2 equations, 4 figures, 10 tables.

Figures (4)

  • Figure 1: Comparison between existing ERC models and the proposed LaERC-S. (a) The existing ERC methods exploit static clues, such as speaker biography and speaker role, for emotional states. (b) The proposed LaERC-S captures rich and deep clues of emotional dynamics, including the mental state and behavior of interlocutors, to trigger the target emotion.
  • Figure 2: The overview of LaERC-S. LaERC-S includes speaker characteristics extraction and injection, emotion recognition. In the speaker characteristics extraction, speaker characteristics are extracted from LLMs. In the speaker characteristics injection, the generated speaker-characteristics are employed to make the models perceive emotional dynamics. In the emotion analysis, the conversational contents and predefined emotional labels are converted into a formatted input for the final response. As depicted in the instance, LaERC-S bridges the gap between speaker characteristics and the response of "sad".
  • Figure 3: The cross-datasets analysis. 'Single' and 'Mixed Ratio' refer to training on a single and mixed dataset, respectively. We sequentially select data from each dataset in the ratios of 1/8, 1/4, 1/2, and 1. 'Avg' represents the average of the differences between 'Single' W-F1 and 'Ratio mix' W-F1.
  • Figure 4: The case study of three samples from IEMOCAP dataset.