Table of Contents
Fetching ...

E-chat: Emotion-sensitive Spoken Dialogue System with Large Language Models

Hongfei Xue, Yuhao Liang, Bingshen Mu, Shiliang Zhang, Mengzhe Chen, Qian Chen, Lei Xie

TL;DR

The paper tackles the challenge of generating emotion-consistent responses in spoken dialogue systems by introducing E-chat, an architecture that fuses emotion embeddings extracted from a HuBERT-based speech encoder with an LLM via a transformer-based connection module. It employs a two-stage training regime and introduces the E-chat200 dataset (178k emotion-labeled spoken-dialogue tuples) to enable end-to-end emotion-conditioned generation, optimizing the joint objective $Loss = (1 - α) · L_{decoder} + α · L_{emotion}$ with $α = 0$ in stage-1 and $α = 0.1$ in stage-2. Experiments show that E-chat achieves superior objective metrics (SIM and BLEU) and MOS compared with baselines like ParalinGPT and a GPT-3.5 topline, along with a 73.6% emotion-recognition accuracy on E-chat200, demonstrating robust emotional understanding and appropriate responses. The work also highlights the value of a dedicated emotion-sensitive dataset for training and points to future directions toward end-to-end speech-to-speech systems that can deliver fully audio-based interactions.

Abstract

This study focuses on emotion-sensitive spoken dialogue in human-machine speech interaction. With the advancement of Large Language Models (LLMs), dialogue systems can handle multimodal data, including audio. Recent models have enhanced the understanding of complex audio signals through the integration of various audio events. However, they are unable to generate appropriate responses based on emotional speech. To address this, we introduce the Emotional chat Model (E-chat), a novel spoken dialogue system capable of comprehending and responding to emotions conveyed from speech. This model leverages an emotion embedding extracted by a speech encoder, combined with LLMs, enabling it to respond according to different emotional contexts. Additionally, we introduce the E-chat200 dataset, designed explicitly for emotion-sensitive spoken dialogue. In various evaluation metrics, E-chat consistently outperforms baseline model, demonstrating its potential in emotional comprehension and human-machine interaction.

E-chat: Emotion-sensitive Spoken Dialogue System with Large Language Models

TL;DR

The paper tackles the challenge of generating emotion-consistent responses in spoken dialogue systems by introducing E-chat, an architecture that fuses emotion embeddings extracted from a HuBERT-based speech encoder with an LLM via a transformer-based connection module. It employs a two-stage training regime and introduces the E-chat200 dataset (178k emotion-labeled spoken-dialogue tuples) to enable end-to-end emotion-conditioned generation, optimizing the joint objective with in stage-1 and in stage-2. Experiments show that E-chat achieves superior objective metrics (SIM and BLEU) and MOS compared with baselines like ParalinGPT and a GPT-3.5 topline, along with a 73.6% emotion-recognition accuracy on E-chat200, demonstrating robust emotional understanding and appropriate responses. The work also highlights the value of a dedicated emotion-sensitive dataset for training and points to future directions toward end-to-end speech-to-speech systems that can deliver fully audio-based interactions.

Abstract

This study focuses on emotion-sensitive spoken dialogue in human-machine speech interaction. With the advancement of Large Language Models (LLMs), dialogue systems can handle multimodal data, including audio. Recent models have enhanced the understanding of complex audio signals through the integration of various audio events. However, they are unable to generate appropriate responses based on emotional speech. To address this, we introduce the Emotional chat Model (E-chat), a novel spoken dialogue system capable of comprehending and responding to emotions conveyed from speech. This model leverages an emotion embedding extracted by a speech encoder, combined with LLMs, enabling it to respond according to different emotional contexts. Additionally, we introduce the E-chat200 dataset, designed explicitly for emotion-sensitive spoken dialogue. In various evaluation metrics, E-chat consistently outperforms baseline model, demonstrating its potential in emotional comprehension and human-machine interaction.
Paper Structure (12 sections, 1 equation, 3 figures, 3 tables)

This paper contains 12 sections, 1 equation, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Emotion-senstive spoken dialogue scenario. $<>$ denotes the emotion of the speech.
  • Figure 2: Architecture schematic diagram of E-chat. $<>$ denotes the emotion of the speech. $()$ denotes the translation of Chinese sentences. On the left is the architecture of E-chat. On the right are the corresponding data samples and prompts for the two-stage training.
  • Figure 3: Examples of E-chat. Each sentence is manually recorded using an Android phone. $<>$ denotes the emotion of the speech. $()$ denotes the translation of Chinese sentences. On the left are three samples, on the right are the corresponding questions and responses for other emotions.