Table of Contents
Fetching ...

RE-LLM: Refining Empathetic Speech-LLM Responses by Integrating Emotion Nuance

Jing-Han Chen, Bo-Hao Su, Ya-Tse Wu, Chi-Chun Lee

TL;DR

This work tackles the limited emotion nuance in speech driven empathetic LLMs by introducing RE-LLM, a speech-LLM augmented with an emotion nuance module that fuses rich speech emotion embeddings and dimensional emotion attributes. The approach combines a frozen emotion encoder, dimensional auxiliary tasks, and a two-step training regime that generates emotion conditioned expected responses and aligns the model through KL divergence and auxiliary losses. Empirical results across IEMOCAP, ESD, and MSP-PODCAST show significant gains in Emotional Reaction and Exploration metrics, as well as improvements in speech emotion recognition UA, demonstrating enhanced emotional understanding and empathetic response generation. Limitations include some cases of limited exploratory prompting and single-turn evaluation, with future work aiming at multi-turn dialogues and deeper analysis of how emotion recognition accuracy affects response quality.

Abstract

With generative AI advancing, empathy in human-AI interaction is essential. While prior work focuses on emotional reflection, emotional exploration, key to deeper engagement, remains overlooked. Existing LLMs rely on text which captures limited emotion nuances. To address this, we propose RE-LLM, a speech-LLM integrating dimensional emotion embeddings and auxiliary learning. Experiments show statistically significant gains in empathy metrics across three datasets. RE-LLM relatively improves the Emotional Reaction score by 14.79% and 6.76% compared to text-only and speech-LLM baselines on ESD. Notably, it raises the Exploration score by 35.42% and 3.91% on IEMOCAP, 139.28% and 9.83% on ESD, and 60.95% and 22.64% on MSP-PODCAST. It also boosts unweighted accuracy by 5.4% on IEMOCAP, 2.3% on ESD, and 6.9% on MSP-PODCAST in speech emotion recognition. These results highlight the enriched emotional understanding and improved empathetic response generation of RE-LLM.

RE-LLM: Refining Empathetic Speech-LLM Responses by Integrating Emotion Nuance

TL;DR

This work tackles the limited emotion nuance in speech driven empathetic LLMs by introducing RE-LLM, a speech-LLM augmented with an emotion nuance module that fuses rich speech emotion embeddings and dimensional emotion attributes. The approach combines a frozen emotion encoder, dimensional auxiliary tasks, and a two-step training regime that generates emotion conditioned expected responses and aligns the model through KL divergence and auxiliary losses. Empirical results across IEMOCAP, ESD, and MSP-PODCAST show significant gains in Emotional Reaction and Exploration metrics, as well as improvements in speech emotion recognition UA, demonstrating enhanced emotional understanding and empathetic response generation. Limitations include some cases of limited exploratory prompting and single-turn evaluation, with future work aiming at multi-turn dialogues and deeper analysis of how emotion recognition accuracy affects response quality.

Abstract

With generative AI advancing, empathy in human-AI interaction is essential. While prior work focuses on emotional reflection, emotional exploration, key to deeper engagement, remains overlooked. Existing LLMs rely on text which captures limited emotion nuances. To address this, we propose RE-LLM, a speech-LLM integrating dimensional emotion embeddings and auxiliary learning. Experiments show statistically significant gains in empathy metrics across three datasets. RE-LLM relatively improves the Emotional Reaction score by 14.79% and 6.76% compared to text-only and speech-LLM baselines on ESD. Notably, it raises the Exploration score by 35.42% and 3.91% on IEMOCAP, 139.28% and 9.83% on ESD, and 60.95% and 22.64% on MSP-PODCAST. It also boosts unweighted accuracy by 5.4% on IEMOCAP, 2.3% on ESD, and 6.9% on MSP-PODCAST in speech emotion recognition. These results highlight the enriched emotional understanding and improved empathetic response generation of RE-LLM.
Paper Structure (19 sections, 10 equations, 1 figure, 2 tables)

This paper contains 19 sections, 10 equations, 1 figure, 2 tables.

Figures (1)

  • Figure 1: The architecture of our proposed RE-LLM comprises a speech-LLM and an emotion nuance module. A preprocessing generation and expected behavioral alignment constrained on nuance emotion training strategy are depicted as well.