DialogueLLM: Context and Emotion Knowledge-Tuned Large Language Models for Emotion Recognition in Conversations
Yazhou Zhang, Mengyao Wang, Youxi Wu, Prayag Tiwari, Qiuchi Li, Benyou Wang, Jing Qin
TL;DR
DialogueLLM introduces an emotion- and context-knowledge tuned LLM by fine-tuning LLaMA 2-7B with a dataset of 2411 multi-modal dialogues (texts and video-derived descriptions) for emotion recognition in conversations. It treats ERC as a conditional generation task, integrating contextual history and multimodal cues via a prompt that includes $C_z$, $T_k$, and a $Text Description(V_k)$ to predict $Y_k$. Empirical results on MELD, IEMOCAP, and EmoryNLP show state-of-the-art performance with notable improvements over 15 baselines and existing SOTA LLMs, while ablation and prompting studies validate the importance of context, multimodal knowledge, and LoRA fine-tuning. The work demonstrates the viability of open-source, task-specific LLMs for nuanced affective understanding, with practical implications for ERC systems and multimodal NLP research, and outlines future work on richer video descriptions and broader affect modeling.
Abstract
Large language models (LLMs) and their variants have shown extraordinary efficacy across numerous downstream natural language processing (NLP) tasks, which has presented a new vision for the development of NLP. Despite their remarkable performance in natural language generating (NLG), LLMs lack a distinct focus on the emotion understanding domain. As a result, using LLMs for emotion recognition may lead to suboptimal and inadequate precision. Another limitation of LLMs is that they are typical trained without leveraging multi-modal information. To overcome these limitations, we propose DialogueLLM, a context and emotion knowledge tuned LLM that is obtained by fine-tuning LLaMA models with 13,638 multi-modal (i.e., texts and videos) emotional dialogues. The visual information is considered as the supplementary knowledge to construct high-quality instructions. We offer a comprehensive evaluation of our proposed model on three benchmarking emotion recognition in conversations (ERC) datasets and compare the results against the SOTA baselines and other SOTA LLMs. Additionally, DialogueLLM-7B can be easily trained using LoRA on a 40GB A100 GPU in 5 hours, facilitating reproducibility for other researchers.
