Leveraging Chain of Thought towards Empathetic Spoken Dialogue without Corresponding Question-Answering Data

Jingran Xie; Shun Lei; Yue Yu; Yang Xiang; Hui Wang; Xixin Wu; Zhiyong Wu

Leveraging Chain of Thought towards Empathetic Spoken Dialogue without Corresponding Question-Answering Data

Jingran Xie, Shun Lei, Yue Yu, Yang Xiang, Hui Wang, Xixin Wu, Zhiyong Wu

TL;DR

The paper tackles the bottleneck of building empathetic spoken dialogue systems without natural spoken empathetic QA data. It introduces Listen-Perceive-Express (LPE), a two-stage training framework that first aligns speech content and emotion to a frozen LLM, then applies Chain-of-Thought prompting to generate empathetic responses without QA supervision, with a joint loss $Loss = L_{decoder} + \lambda \cdot L_{emotion}$ and $\lambda=0.1$. Empirical results show Stage 2 improves emotion classification by about 30%, and zero-shot CoT with predefined steps yields the strongest empathy-content balance, with LPE outperforming cascaded baselines and end-to-end speech LLMs on both objective and subjective metrics. This approach reduces reliance on costly QA data and offers a practical, low-cost path to high-quality empathetic spoken dialogue, albeit currently producing text-based outputs rather than synthesized speech.

Abstract

Empathetic dialogue is crucial for natural human-computer interaction, allowing the dialogue system to respond in a more personalized and emotionally aware manner, improving user satisfaction and engagement. The emergence of large language models (LLMs) has revolutionized dialogue generation by harnessing their powerful capabilities and shown its potential in multimodal domains. Many studies have integrated speech with text-based LLMs to take speech question as input and output text response. However, the lack of spoken question-answering datasets that include speech style information to supervised fine-tuning (SFT) limits the performance of these systems. As a result, while these systems excel at understanding speech content, they often struggle to generate empathetic responses. In response, we propose a novel approach that circumvents the need for question-answering data, called Listen, Perceive, and Express (LPE). Our method employs a two-stage training process, initially guiding the LLM to listen the content and perceive the emotional aspects of speech. Subsequently, we utilize Chain-of-Thought (CoT) prompting to unlock the model's potential for expressing empathetic responses based on listened spoken content and perceived emotional cues. We employ experiments to prove the effectiveness of proposed method. To our knowledge, this is the first attempt to leverage CoT for speech-based dialogue.

Leveraging Chain of Thought towards Empathetic Spoken Dialogue without Corresponding Question-Answering Data

TL;DR

and

. Empirical results show Stage 2 improves emotion classification by about 30%, and zero-shot CoT with predefined steps yields the strongest empathy-content balance, with LPE outperforming cascaded baselines and end-to-end speech LLMs on both objective and subjective metrics. This approach reduces reliance on costly QA data and offers a practical, low-cost path to high-quality empathetic spoken dialogue, albeit currently producing text-based outputs rather than synthesized speech.

Abstract

Paper Structure (16 sections, 1 equation, 2 figures, 5 tables)

This paper contains 16 sections, 1 equation, 2 figures, 5 tables.

Introduction
Proposed Method
Model Architecture
Two-Stage Training
Chain of Thought
Experiments
Training Setups
Test Setup
Baselines
Generation Evaluation Metrics
Results
Two-stage Training Analysis
Generation Analysis
CoT Analysis
Ablation Study
...and 1 more sections

Figures (2)

Figure 1: Architecture of the proposed model is shown on left side. On the right is the real sample showing how it listens, perceives, and expresses. () denotes the transcription of the input speech. $<>$ denotes the emotion of the speech. We use $<>$ to highlight each step of LPE, which is not a part of our prompt.
Figure 2: Win rate compared with cascade model.

Leveraging Chain of Thought towards Empathetic Spoken Dialogue without Corresponding Question-Answering Data

TL;DR

Abstract

Leveraging Chain of Thought towards Empathetic Spoken Dialogue without Corresponding Question-Answering Data

Authors

TL;DR

Abstract

Table of Contents

Figures (2)