Table of Contents
Fetching ...

From Measurement to Expertise: Empathetic Expert Adapters for Context-Based Empathy in Conversational AI Agents

Erfan Shayegani, Jina Suh, Andy Wilson, Nagu Rangan, Javier Hernandez

TL;DR

This work targets the gap between generic empathy in conversational AI and the need for task- and context-specific empathy. It analyzes real-world SENSE-7 data to uncover how user expectations of empathy vary by task and how perceived empathy correlates with satisfaction. Leveraging these insights, the authors build a synthetic multi-turn data generation pipeline, define task-specific empathy patterns, and train four context-specific empathetic expert adapters (LoRA-based) on frozen LLM backbones, guided by both generative and learning-based reward models. Across LLMS of different scales, their adapters outperform Baseline and System Prompt baselines in maintaining and aligning empathy with user expectations, particularly in long multi-turn conversations, demonstrating practical gains for user satisfaction, robustness, and privacy-preserving evaluation. The approach offers a concrete path to deploy context-aware empathetic agents in real-world settings, with implications for RLHF integration, mixture-of-experts architectures, and ethical deployment considerations.

Abstract

Empathy is a critical factor in fostering positive user experiences in conversational AI. While models can display empathy, it is often generic rather than tailored to specific tasks and contexts. In this work, we introduce a novel framework for developing and evaluating context-specific empathetic large language models (LLMs). We first analyze a real-world conversational dataset consisting of 672 multi-turn conversations across 8 tasks, revealing significant differences in terms of expected and experienced empathy before and after the conversations, respectively. To help minimize this gap, we develop a synthetic multi-turn conversational generation pipeline and steer responses toward our defined empathy patterns based on the context that more closely matches users' expectations. We then train empathetic expert adapters for context-specific empathy that specialize in varying empathy levels based on the recognized task. Our empirical results demonstrate a significant gap reduction of 72.66% between perceived and desired empathy with scores increasing by an average factor of 2.43 as measured by our metrics and reward models. Additionally, our trained empathetic expert adapters demonstrate superior effectiveness in preserving empathy patterns throughout conversation turns, outperforming system prompts, which tend to dramatically diminish in impact as conversations lengthen.

From Measurement to Expertise: Empathetic Expert Adapters for Context-Based Empathy in Conversational AI Agents

TL;DR

This work targets the gap between generic empathy in conversational AI and the need for task- and context-specific empathy. It analyzes real-world SENSE-7 data to uncover how user expectations of empathy vary by task and how perceived empathy correlates with satisfaction. Leveraging these insights, the authors build a synthetic multi-turn data generation pipeline, define task-specific empathy patterns, and train four context-specific empathetic expert adapters (LoRA-based) on frozen LLM backbones, guided by both generative and learning-based reward models. Across LLMS of different scales, their adapters outperform Baseline and System Prompt baselines in maintaining and aligning empathy with user expectations, particularly in long multi-turn conversations, demonstrating practical gains for user satisfaction, robustness, and privacy-preserving evaluation. The approach offers a concrete path to deploy context-aware empathetic agents in real-world settings, with implications for RLHF integration, mixture-of-experts architectures, and ethical deployment considerations.

Abstract

Empathy is a critical factor in fostering positive user experiences in conversational AI. While models can display empathy, it is often generic rather than tailored to specific tasks and contexts. In this work, we introduce a novel framework for developing and evaluating context-specific empathetic large language models (LLMs). We first analyze a real-world conversational dataset consisting of 672 multi-turn conversations across 8 tasks, revealing significant differences in terms of expected and experienced empathy before and after the conversations, respectively. To help minimize this gap, we develop a synthetic multi-turn conversational generation pipeline and steer responses toward our defined empathy patterns based on the context that more closely matches users' expectations. We then train empathetic expert adapters for context-specific empathy that specialize in varying empathy levels based on the recognized task. Our empirical results demonstrate a significant gap reduction of 72.66% between perceived and desired empathy with scores increasing by an average factor of 2.43 as measured by our metrics and reward models. Additionally, our trained empathetic expert adapters demonstrate superior effectiveness in preserving empathy patterns throughout conversation turns, outperforming system prompts, which tend to dramatically diminish in impact as conversations lengthen.

Paper Structure

This paper contains 38 sections, 2 equations, 16 figures, 10 tables.

Figures (16)

  • Figure 1: Our approach consists of multiple stages: extracting insights from real human-AI interactions, defining task-specific empathy patterns, generating synthetic conversations, and steering them for preference datasets. We then measure empathy using task-specific and generic reward models followed by an alignment stage where context-specific empathetic expert adapters are trained to enhance empathetic responses.
  • Figure 2: System prompt for multi-turn coherency. The TASK_CLUSTER variable can be one of: Distressing/Social/Personal Situations, Learning Skills, Work Issues/Career/Self-Improvement, and Work Assignment/Help with Writing
  • Figure 3: The trained preference model's predictions. $\beta$ sets the sensitivity and the sharpness of the preference model. The smaller, the sharper. 'Chosen' corresponds to the empathetic steered conversations, while 'Rejected' means the non-empathetic steered conversations. The preference model has successfully learned to assign higher scores to our defined empathy patterns, and lower scores to the non-empathetic conversations.
  • Figure 4: The generic reward model's predictions. MSE = 0.0301, MAE = 0.1335, Correlation (Ground Truth, Predictions) = 0.43
  • Figure 5: Comparison of empathy levels across different tasks, illustrating the effectiveness of context-specific empathetic expert adapters in aligning with pre-desired empathy levels. Each task shows the pre-desired empathy (black bars), post-task inherent empathy of LLMs (red bars), and post-adapter empathy (maroon bars). This work aims to precisely calibrate empathy in AI responses to match the desired level specified by task and context requirements. As seen, the maroon bars (context-specific empathetic expert adapters) consistently align more closely with the black bars, outperforming the inherent empathy responses of the LLM (red bars). Results are averaged across both Llama-3 and Phi-3 models, demonstrating the effectiveness of our empathetic expert adapters in achieving precise empathy alignment tailored to the task and user context.
  • ...and 11 more figures