Talk With Human-like Agents: Empathetic Dialogue Through Perceptible Acoustic Reception and Reaction

Haoqiu Yan; Yongxin Zhu; Kai Zheng; Bing Liu; Haoyu Cao; Deqiang Jiang; Linli Xu

Talk With Human-like Agents: Empathetic Dialogue Through Perceptible Acoustic Reception and Reaction

Haoqiu Yan, Yongxin Zhu, Kai Zheng, Bing Liu, Haoyu Cao, Deqiang Jiang, Linli Xu

TL;DR

PerceptiveAgent addresses a gap in empathetic dialogue by integrating perceptual acoustic information into a multi-modal pipeline. The system combines a speech captioner to extract prosodic cues, an LLM as cognitive core to infer intent, and an MSMA-Synthesizer to produce expressive speech conditioned on content and captions. Empirical results show improvements in both linguistic alignment (cognitive empathy) and vocal expressiveness (affective empathy) over text-only baselines, with ablations confirming the value of captions and style factors. This approach advances human-like dialogue by making AI responses more contextually aware and emotionally engaging, with potential for broader language coverage and real-world applications.

Abstract

Large Language Model (LLM)-enhanced agents become increasingly prevalent in Human-AI communication, offering vast potential from entertainment to professional domains. However, current multi-modal dialogue systems overlook the acoustic information present in speech, which is crucial for understanding human communication nuances. This oversight can lead to misinterpretations of speakers' intentions, resulting in inconsistent or even contradictory responses within dialogues. To bridge this gap, in this paper, we propose PerceptiveAgent, an empathetic multi-modal dialogue system designed to discern deeper or more subtle meanings beyond the literal interpretations of words through the integration of speech modality perception. Employing LLMs as a cognitive core, PerceptiveAgent perceives acoustic information from input speech and generates empathetic responses based on speaking styles described in natural language. Experimental results indicate that PerceptiveAgent excels in contextual understanding by accurately discerning the speakers' true intentions in scenarios where the linguistic meaning is either contrary to or inconsistent with the speaker's true feelings, producing more nuanced and expressive spoken dialogues. Code is publicly available at: \url{https://github.com/Haoqiu-Yan/PerceptiveAgent}.

Talk With Human-like Agents: Empathetic Dialogue Through Perceptible Acoustic Reception and Reaction

TL;DR

Abstract

Paper Structure (26 sections, 5 figures, 5 tables)

This paper contains 26 sections, 5 figures, 5 tables.

Introduction
Related Work
Multi-modal Dialogue Systems
Cross-Modal Text Generation
Expressive Text-to-Speech Synthesis
Methods
Speech Captioner
Multi-modal Embedding Alignment
Instruction Tuning
PerceptiveAgent
Caption for Intention Discerning
Comprehension through Sensory Integration
Expressive Speech Synthesis
Experiments
Experimental Setup
...and 11 more sections

Figures (5)

Figure 1: Examples illustrating the definition of empathy within dialogues.
Figure 2: The overall architecture of PerceptiveAgent. Three components are interconnected: the speech captioner, the LLM and the MSMA-Synthesizer. The speech captioner serves as a multi-modal sensory system, perceiving acoustic information from the dialogue history, which is crucial for discerning the speakers' intentions. The LLM acts as the cognitive core, responsible for comprehending the speakers' thoughts and emotions. Conditioned on the response contents and multiple attributes provided by the LLM, the MSMA-Synthesizer generates expressive speech outputs.
Figure 3: Cases comparing the response quality between Speech-GPT3.5 and PerceptiveAgent.
Figure :
Figure :

Talk With Human-like Agents: Empathetic Dialogue Through Perceptible Acoustic Reception and Reaction

TL;DR

Abstract

Talk With Human-like Agents: Empathetic Dialogue Through Perceptible Acoustic Reception and Reaction

Authors

TL;DR

Abstract

Table of Contents

Figures (5)