Table of Contents
Fetching ...

Empathy Through Multimodality in Conversational Interfaces

Mahyar Abbasian, Iman Azimi, Mohammad Feli, Amir M. Rahmani, Ramesh Jain

TL;DR

This work introduces an LLM-powered multimodal Conversational Health Agent (CHA) built on the openCHA framework to deliver emotionally resonant mental health support. By integrating speech-to-text, speech emotion detection (wav2vec2 fine-tuned on IEMOCAP), web-based information retrieval, and text-to-speech, the CHA interprets vocal cues to produce contextually appropriate, empathetic verbal responses. The system relies on a modular orchestrator with planning (Tree of Thought), execution, and memory to coordinate emotion signals with reliable health sources, and is evaluated through planning consistency and human assessments of empathy, showing strong emotion identification accuracy (89%) and generally positive empathy alignment, especially for sadness. The results underscore the importance of vocal emotion recognition in strengthening empathetic connections in CHAs and point to future work integrating additional modalities like facial expressions and physiological signals for more holistic empathy.

Abstract

Agents represent one of the most emerging applications of Large Language Models (LLMs) and Generative AI, with their effectiveness hinging on multimodal capabilities to navigate complex user environments. Conversational Health Agents (CHAs), a prime example of this, are redefining healthcare by offering nuanced support that transcends textual analysis to incorporate emotional intelligence. This paper introduces an LLM-based CHA engineered for rich, multimodal dialogue-especially in the realm of mental health support. It adeptly interprets and responds to users' emotional states by analyzing multimodal cues, thus delivering contextually aware and empathetically resonant verbal responses. Our implementation leverages the versatile openCHA framework, and our comprehensive evaluation involves neutral prompts expressed in diverse emotional tones: sadness, anger, and joy. We evaluate the consistency and repeatability of the planning capability of the proposed CHA. Furthermore, human evaluators critique the CHA's empathic delivery, with findings revealing a striking concordance between the CHA's outputs and evaluators' assessments. These results affirm the indispensable role of vocal (soon multimodal) emotion recognition in strengthening the empathetic connection built by CHAs, cementing their place at the forefront of interactive, compassionate digital health solutions.

Empathy Through Multimodality in Conversational Interfaces

TL;DR

This work introduces an LLM-powered multimodal Conversational Health Agent (CHA) built on the openCHA framework to deliver emotionally resonant mental health support. By integrating speech-to-text, speech emotion detection (wav2vec2 fine-tuned on IEMOCAP), web-based information retrieval, and text-to-speech, the CHA interprets vocal cues to produce contextually appropriate, empathetic verbal responses. The system relies on a modular orchestrator with planning (Tree of Thought), execution, and memory to coordinate emotion signals with reliable health sources, and is evaluated through planning consistency and human assessments of empathy, showing strong emotion identification accuracy (89%) and generally positive empathy alignment, especially for sadness. The results underscore the importance of vocal emotion recognition in strengthening empathetic connections in CHAs and point to future work integrating additional modalities like facial expressions and physiological signals for more holistic empathy.

Abstract

Agents represent one of the most emerging applications of Large Language Models (LLMs) and Generative AI, with their effectiveness hinging on multimodal capabilities to navigate complex user environments. Conversational Health Agents (CHAs), a prime example of this, are redefining healthcare by offering nuanced support that transcends textual analysis to incorporate emotional intelligence. This paper introduces an LLM-based CHA engineered for rich, multimodal dialogue-especially in the realm of mental health support. It adeptly interprets and responds to users' emotional states by analyzing multimodal cues, thus delivering contextually aware and empathetically resonant verbal responses. Our implementation leverages the versatile openCHA framework, and our comprehensive evaluation involves neutral prompts expressed in diverse emotional tones: sadness, anger, and joy. We evaluate the consistency and repeatability of the planning capability of the proposed CHA. Furthermore, human evaluators critique the CHA's empathic delivery, with findings revealing a striking concordance between the CHA's outputs and evaluators' assessments. These results affirm the indispensable role of vocal (soon multimodal) emotion recognition in strengthening the empathetic connection built by CHAs, cementing their place at the forefront of interactive, compassionate digital health solutions.
Paper Structure (12 sections, 2 figures, 2 tables)

This paper contains 12 sections, 2 figures, 2 tables.

Figures (2)

  • Figure 1: LLM-based CHA for multimodal speech-based emotional support
  • Figure 2: Examples of developed CHA answering a user voice query with Sad (a) and Happy (b) emotions