Table of Contents
Fetching ...

Talk With Human-like Agents: Empathetic Dialogue Through Perceptible Acoustic Reception and Reaction

Haoqiu Yan, Yongxin Zhu, Kai Zheng, Bing Liu, Haoyu Cao, Deqiang Jiang, Linli Xu

TL;DR

PerceptiveAgent addresses a gap in empathetic dialogue by integrating perceptual acoustic information into a multi-modal pipeline. The system combines a speech captioner to extract prosodic cues, an LLM as cognitive core to infer intent, and an MSMA-Synthesizer to produce expressive speech conditioned on content and captions. Empirical results show improvements in both linguistic alignment (cognitive empathy) and vocal expressiveness (affective empathy) over text-only baselines, with ablations confirming the value of captions and style factors. This approach advances human-like dialogue by making AI responses more contextually aware and emotionally engaging, with potential for broader language coverage and real-world applications.

Abstract

Large Language Model (LLM)-enhanced agents become increasingly prevalent in Human-AI communication, offering vast potential from entertainment to professional domains. However, current multi-modal dialogue systems overlook the acoustic information present in speech, which is crucial for understanding human communication nuances. This oversight can lead to misinterpretations of speakers' intentions, resulting in inconsistent or even contradictory responses within dialogues. To bridge this gap, in this paper, we propose PerceptiveAgent, an empathetic multi-modal dialogue system designed to discern deeper or more subtle meanings beyond the literal interpretations of words through the integration of speech modality perception. Employing LLMs as a cognitive core, PerceptiveAgent perceives acoustic information from input speech and generates empathetic responses based on speaking styles described in natural language. Experimental results indicate that PerceptiveAgent excels in contextual understanding by accurately discerning the speakers' true intentions in scenarios where the linguistic meaning is either contrary to or inconsistent with the speaker's true feelings, producing more nuanced and expressive spoken dialogues. Code is publicly available at: \url{https://github.com/Haoqiu-Yan/PerceptiveAgent}.

Talk With Human-like Agents: Empathetic Dialogue Through Perceptible Acoustic Reception and Reaction

TL;DR

PerceptiveAgent addresses a gap in empathetic dialogue by integrating perceptual acoustic information into a multi-modal pipeline. The system combines a speech captioner to extract prosodic cues, an LLM as cognitive core to infer intent, and an MSMA-Synthesizer to produce expressive speech conditioned on content and captions. Empirical results show improvements in both linguistic alignment (cognitive empathy) and vocal expressiveness (affective empathy) over text-only baselines, with ablations confirming the value of captions and style factors. This approach advances human-like dialogue by making AI responses more contextually aware and emotionally engaging, with potential for broader language coverage and real-world applications.

Abstract

Large Language Model (LLM)-enhanced agents become increasingly prevalent in Human-AI communication, offering vast potential from entertainment to professional domains. However, current multi-modal dialogue systems overlook the acoustic information present in speech, which is crucial for understanding human communication nuances. This oversight can lead to misinterpretations of speakers' intentions, resulting in inconsistent or even contradictory responses within dialogues. To bridge this gap, in this paper, we propose PerceptiveAgent, an empathetic multi-modal dialogue system designed to discern deeper or more subtle meanings beyond the literal interpretations of words through the integration of speech modality perception. Employing LLMs as a cognitive core, PerceptiveAgent perceives acoustic information from input speech and generates empathetic responses based on speaking styles described in natural language. Experimental results indicate that PerceptiveAgent excels in contextual understanding by accurately discerning the speakers' true intentions in scenarios where the linguistic meaning is either contrary to or inconsistent with the speaker's true feelings, producing more nuanced and expressive spoken dialogues. Code is publicly available at: \url{https://github.com/Haoqiu-Yan/PerceptiveAgent}.
Paper Structure (26 sections, 5 figures, 5 tables)

This paper contains 26 sections, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Examples illustrating the definition of empathy within dialogues.
  • Figure 2: The overall architecture of PerceptiveAgent. Three components are interconnected: the speech captioner, the LLM and the MSMA-Synthesizer. The speech captioner serves as a multi-modal sensory system, perceiving acoustic information from the dialogue history, which is crucial for discerning the speakers' intentions. The LLM acts as the cognitive core, responsible for comprehending the speakers' thoughts and emotions. Conditioned on the response contents and multiple attributes provided by the LLM, the MSMA-Synthesizer generates expressive speech outputs.
  • Figure 3: Cases comparing the response quality between Speech-GPT3.5 and PerceptiveAgent.
  • Figure :
  • Figure :