Table of Contents
Fetching ...

Empathy by Design: Aligning Large Language Models for Healthcare Dialogue

Emre Umucu, Guillermina Solis, Leon Garza, Emilia Rivas, Beatrice Lee, Anantaa Kotal, Aritran Piplai

TL;DR

This work tackles the risk of inaccurate and impersonal guidance from general-purpose LLMs in caregiver–patient conversations by introducing a Direct Preference Optimization (DPO)–based alignment pipeline. By constructing a caregiver-focused QA dataset and training with paired preferred/rejected responses, the approach directly optimizes for empathy, simplicity, and factual accuracy, complemented by LoRA-based parameter efficiency. Comprehensive evaluation across semantic, factual, and human-centric metrics demonstrates that DPO-tuned LLaMA-based systems achieve superior semantic alignment, stronger factual grounding, and more empathetic, readable, and appropriately formal dialogue compared with baselines and some commercial systems. The results support a scalable, transparent path to trustworthy, patient- and caregiver–oriented AI assistants in geriatrics and dementia care, with open-source releases to enable replication and extension.

Abstract

General-purpose large language models (LLMs) have demonstrated remarkable generative and reasoning capabilities but remain limited in healthcare and caregiving applications due to two key deficiencies: factual unreliability and a lack of empathetic communication. These shortcomings pose significant risks in sensitive contexts where users, particularly non-professionals and caregivers, seek medically relevant guidance or emotional reassurance. To address these challenges, we introduce a Direct Preference Optimization (DPO)-based alignment framework designed to improve factual correctness, semantic coherence, and human-centric qualities such as empathy, politeness, and simplicity in caregiver-patient dialogues. Our approach fine-tunes domain-adapted LLMs using pairwise preference data, where preferred responses reflect supportive and accessible communication styles while rejected ones represent prescriptive or overly technical tones. This direct optimization method aligns model outputs with human preferences more efficiently than traditional reinforcement-learning-based alignment. Empirical evaluations across multiple open and proprietary LLMs show that our DPO-tuned models achieve higher semantic alignment, improved factual accuracy, and stronger human-centric evaluation scores compared to baseline and commercial alternatives such as Google medical dialogue systems. These improvements demonstrate that preference-based alignment offers a scalable and transparent pathway toward developing trustworthy, empathetic, and clinically informed AI assistants for caregiver and healthcare communication. Our open-source code is available at: https://github.com/LeonG19/Empathy-by-Design

Empathy by Design: Aligning Large Language Models for Healthcare Dialogue

TL;DR

This work tackles the risk of inaccurate and impersonal guidance from general-purpose LLMs in caregiver–patient conversations by introducing a Direct Preference Optimization (DPO)–based alignment pipeline. By constructing a caregiver-focused QA dataset and training with paired preferred/rejected responses, the approach directly optimizes for empathy, simplicity, and factual accuracy, complemented by LoRA-based parameter efficiency. Comprehensive evaluation across semantic, factual, and human-centric metrics demonstrates that DPO-tuned LLaMA-based systems achieve superior semantic alignment, stronger factual grounding, and more empathetic, readable, and appropriately formal dialogue compared with baselines and some commercial systems. The results support a scalable, transparent path to trustworthy, patient- and caregiver–oriented AI assistants in geriatrics and dementia care, with open-source releases to enable replication and extension.

Abstract

General-purpose large language models (LLMs) have demonstrated remarkable generative and reasoning capabilities but remain limited in healthcare and caregiving applications due to two key deficiencies: factual unreliability and a lack of empathetic communication. These shortcomings pose significant risks in sensitive contexts where users, particularly non-professionals and caregivers, seek medically relevant guidance or emotional reassurance. To address these challenges, we introduce a Direct Preference Optimization (DPO)-based alignment framework designed to improve factual correctness, semantic coherence, and human-centric qualities such as empathy, politeness, and simplicity in caregiver-patient dialogues. Our approach fine-tunes domain-adapted LLMs using pairwise preference data, where preferred responses reflect supportive and accessible communication styles while rejected ones represent prescriptive or overly technical tones. This direct optimization method aligns model outputs with human preferences more efficiently than traditional reinforcement-learning-based alignment. Empirical evaluations across multiple open and proprietary LLMs show that our DPO-tuned models achieve higher semantic alignment, improved factual accuracy, and stronger human-centric evaluation scores compared to baseline and commercial alternatives such as Google medical dialogue systems. These improvements demonstrate that preference-based alignment offers a scalable and transparent pathway toward developing trustworthy, empathetic, and clinically informed AI assistants for caregiver and healthcare communication. Our open-source code is available at: https://github.com/LeonG19/Empathy-by-Design

Paper Structure

This paper contains 21 sections, 1 equation, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Overview of the proposed DPO fine-tuning workflow for healthcare language models. Web-scraped Alzheimer’s resources are transformed into paired question–answer examples, where preferred (A$^{+}$) responses emphasize empathy and factual accuracy, and rejected (A$^{-}$) responses contain undesirable traits. Direct Preference Optimization (DPO) aligns the base model toward the preferred behavior, producing a fine-tuned LLM capable of compassionate and reliable caregiving dialogue.
  • Figure 2: Overview of the Modified BERT-Score evaluation process. Instead of measuring lexical overlap, this version prompts a large language model to assess factual consistency between a candidate and reference answer. The evaluator ignores style or grammar and focuses on factual accuracy, completeness, and contradictions, assigning a score from 0.0 to 1.0 in 0.1 steps. Scores $\geq 0.9$ indicate full factual alignment. This prompt-based metric complements G-Eval Correctness and NLI Consistency to capture fine-grained factual reliability in caregiver-oriented healthcare QA.
  • Figure 3: Benchmark of Llama 3.1-8B and its adaptations across metrics for each evaluation type. For visual consistency, semantic similarity scores (SS:E) and G-Eval, originally ranging from 0–1, were multiplied by 10 to align with the scale of readability (FK-GL) metrics. Higher scores for SS:E indicate stronger alignment between generated answers and their corresponding reference responses. Higher scores for GEval indicate a higher correctness score. Lower FK-GL scores indicate responses that are linguistically accessible.