E-chat: Emotion-sensitive Spoken Dialogue System with Large Language Models
Hongfei Xue, Yuhao Liang, Bingshen Mu, Shiliang Zhang, Mengzhe Chen, Qian Chen, Lei Xie
TL;DR
The paper tackles the challenge of generating emotion-consistent responses in spoken dialogue systems by introducing E-chat, an architecture that fuses emotion embeddings extracted from a HuBERT-based speech encoder with an LLM via a transformer-based connection module. It employs a two-stage training regime and introduces the E-chat200 dataset (178k emotion-labeled spoken-dialogue tuples) to enable end-to-end emotion-conditioned generation, optimizing the joint objective $Loss = (1 - α) · L_{decoder} + α · L_{emotion}$ with $α = 0$ in stage-1 and $α = 0.1$ in stage-2. Experiments show that E-chat achieves superior objective metrics (SIM and BLEU) and MOS compared with baselines like ParalinGPT and a GPT-3.5 topline, along with a 73.6% emotion-recognition accuracy on E-chat200, demonstrating robust emotional understanding and appropriate responses. The work also highlights the value of a dedicated emotion-sensitive dataset for training and points to future directions toward end-to-end speech-to-speech systems that can deliver fully audio-based interactions.
Abstract
This study focuses on emotion-sensitive spoken dialogue in human-machine speech interaction. With the advancement of Large Language Models (LLMs), dialogue systems can handle multimodal data, including audio. Recent models have enhanced the understanding of complex audio signals through the integration of various audio events. However, they are unable to generate appropriate responses based on emotional speech. To address this, we introduce the Emotional chat Model (E-chat), a novel spoken dialogue system capable of comprehending and responding to emotions conveyed from speech. This model leverages an emotion embedding extracted by a speech encoder, combined with LLMs, enabling it to respond according to different emotional contexts. Additionally, we introduce the E-chat200 dataset, designed explicitly for emotion-sensitive spoken dialogue. In various evaluation metrics, E-chat consistently outperforms baseline model, demonstrating its potential in emotional comprehension and human-machine interaction.
