Table of Contents
Fetching ...

Multi-Objective Alignment of Language Models for Personalized Psychotherapy

Mehrab Beikzadeh, Yasaman Asadollah Salmanpour, Ashima Suvarna, Sriram Sankararaman, Matteo Malgaroli, Majid Sarrafzadeh, Saadia Gabriel

TL;DR

A multi-objective alignment framework using direct preference optimization for empathy, safety, active listening, self-motivated change, trust/rapport, and patient autonomy is developed, and blinded clinician evaluation confirms MODPO is consistently preferred.

Abstract

Mental health disorders affect over 1 billion people worldwide, yet access to care remains limited by workforce shortages and cost constraints. While AI systems show therapeutic promise, current alignment approaches optimize objectives independently, failing to balance patient preferences with clinical safety. We survey 335 individuals with lived mental health experience to collect preference rankings across therapeutic dimensions, then develop a multi-objective alignment framework using direct preference optimization. We train reward models for six criteria -- empathy, safety, active listening, self-motivated change, trust/rapport, and patient autonomy -- and systematically compare multi-objective approaches against single-objective optimization, supervised fine-tuning, and parameter merging. Multi-objective DPO (MODPO) achieves superior balance (77.6% empathy, 62.6% safety) compared to single-objective optimization (93.6% empathy, 47.8% safety), and therapeutic criteria outperform general communication principles by 17.2%. Blinded clinician evaluation confirms MODPO is consistently preferred, with LLM-evaluator agreement comparable to inter-clinician reliability.

Multi-Objective Alignment of Language Models for Personalized Psychotherapy

TL;DR

A multi-objective alignment framework using direct preference optimization for empathy, safety, active listening, self-motivated change, trust/rapport, and patient autonomy is developed, and blinded clinician evaluation confirms MODPO is consistently preferred.

Abstract

Mental health disorders affect over 1 billion people worldwide, yet access to care remains limited by workforce shortages and cost constraints. While AI systems show therapeutic promise, current alignment approaches optimize objectives independently, failing to balance patient preferences with clinical safety. We survey 335 individuals with lived mental health experience to collect preference rankings across therapeutic dimensions, then develop a multi-objective alignment framework using direct preference optimization. We train reward models for six criteria -- empathy, safety, active listening, self-motivated change, trust/rapport, and patient autonomy -- and systematically compare multi-objective approaches against single-objective optimization, supervised fine-tuning, and parameter merging. Multi-objective DPO (MODPO) achieves superior balance (77.6% empathy, 62.6% safety) compared to single-objective optimization (93.6% empathy, 47.8% safety), and therapeutic criteria outperform general communication principles by 17.2%. Blinded clinician evaluation confirms MODPO is consistently preferred, with LLM-evaluator agreement comparable to inter-clinician reliability.
Paper Structure (19 sections, 2 equations, 21 figures, 1 table)

This paper contains 19 sections, 2 equations, 21 figures, 1 table.

Figures (21)

  • Figure 1: Primary therapeutic preference distribution across persona sets.
  • Figure 2: Safety-empathy performance across training approaches. Each point shows a model's average win rate computed across all pairwise head-to-head comparisons.
  • Figure 3: Statistical significance matrices from pairwise head-to-head comparisons using McNemar's test ($\alpha$=0.05, n=600). $\triangle$ = row wins, $\nabla$ = column wins, = = no significant difference.
  • Figure 4: Phase 2 safety-overall preference trade-off. All trained models substantially outperform baseline on overall preference while maintaining or improving safety. Therapeutic-specific criteria (MODPO_Survey, MODPO_Survey4) achieve higher overall preference than general principles (MODPO_Maxim).
  • Figure 5: Human validation results. (\ref{['fig:human_winrates']}) Win rates by criterion; bars show the fraction of questions where the clinician majority selected MODPO_Survey, base_model, or tie. (\ref{['fig:fair_non_tie']}) Non-tie (binary) agreement with human reliability context; we compare leave-one-out human vs consensus and LLM vs consensus on the same non-tie subset, reporting accuracy, Cohen's $\kappa$, and Gwet's AC1; Fleiss' $\kappa$ provides overall inter-rater reliability context. (\ref{['fig:agreement_levels_3class']}) Agreement levels (3-class, all questions including ties); exact-match agreement is computed for human--human (pairwise across clinician pairs), individual human--LLM, and LLM--human majority. (\ref{['fig:confusion_3class_safety']}, \ref{['fig:confusion_3class_overall']}) 3-class confusion matrices (LLM vs clinician majority); rows are clinician-majority labels and columns are LLM predictions. (\ref{['fig:consensus_strat']}) LLM accuracy stratified by clinician consensus strength (non-tie questions only).
  • ...and 16 more figures