Table of Contents
Fetching ...

PSYCHE: A Multi-faceted Patient Simulation Framework for Evaluation of Psychiatric Assessment Conversational Agents

Jingoo Lee, Kyungho Lim, Young-Chul Jung, Byung-Hoon Kim

TL;DR

PSYCHE introduces a construct-grounded evaluation framework for psychiatric assessment conversational agents (PACAs) by simulating patients (PSYCHE-SP) through a multi-faceted construct (MFC) and scoring PACA performance against ground-truth constructs (Construct-SP) via the PSYCHE RUBRIC to yield the PSYCHE SCORE. The approach emphasizes clinical relevance, ethical safety, cost efficiency, and quantitative measurability, validated with 10 board-certified psychiatrists across seven disorders. Results show high conformity of PSYCHE-SP utterances (85–97%, average 93%) and a strong correlation between PSYCHE SCORE and expert scores (r = 0.8486), with moderate convergent validity to PIQSCA (r = 0.6367). The work demonstrates the framework’s robustness to weight settings, supports safer, scalable PACA benchmarking, and offers a pathway to extend construct-grounded evaluation to other psychiatric or medical assessment domains.

Abstract

Recent advances in large language models (LLMs) have accelerated the development of conversational agents capable of generating human-like responses. Since psychiatric assessments typically involve complex conversational interactions between psychiatrists and patients, there is growing interest in developing LLM-based psychiatric assessment conversational agents (PACAs) that aim to simulate the role of psychiatrists in clinical evaluations. However, standardized methods for benchmarking the clinical appropriateness of PACAs' interaction with patients still remain underexplored. Here, we propose PSYCHE, a novel framework designed to enable the 1) clinically relevant, 2) ethically safe, 3) cost-efficient, and 4) quantitative evaluation of PACAs. This is achieved by simulating psychiatric patients based on a multi-faceted psychiatric construct that defines the simulated patients' profiles, histories, and behaviors, which PACAs are expected to assess. We validate the effectiveness of PSYCHE through a study with 10 board-certified psychiatrists, supported by an in-depth analysis of the simulated patient utterances.

PSYCHE: A Multi-faceted Patient Simulation Framework for Evaluation of Psychiatric Assessment Conversational Agents

TL;DR

PSYCHE introduces a construct-grounded evaluation framework for psychiatric assessment conversational agents (PACAs) by simulating patients (PSYCHE-SP) through a multi-faceted construct (MFC) and scoring PACA performance against ground-truth constructs (Construct-SP) via the PSYCHE RUBRIC to yield the PSYCHE SCORE. The approach emphasizes clinical relevance, ethical safety, cost efficiency, and quantitative measurability, validated with 10 board-certified psychiatrists across seven disorders. Results show high conformity of PSYCHE-SP utterances (85–97%, average 93%) and a strong correlation between PSYCHE SCORE and expert scores (r = 0.8486), with moderate convergent validity to PIQSCA (r = 0.6367). The work demonstrates the framework’s robustness to weight settings, supports safer, scalable PACA benchmarking, and offers a pathway to extend construct-grounded evaluation to other psychiatric or medical assessment domains.

Abstract

Recent advances in large language models (LLMs) have accelerated the development of conversational agents capable of generating human-like responses. Since psychiatric assessments typically involve complex conversational interactions between psychiatrists and patients, there is growing interest in developing LLM-based psychiatric assessment conversational agents (PACAs) that aim to simulate the role of psychiatrists in clinical evaluations. However, standardized methods for benchmarking the clinical appropriateness of PACAs' interaction with patients still remain underexplored. Here, we propose PSYCHE, a novel framework designed to enable the 1) clinically relevant, 2) ethically safe, 3) cost-efficient, and 4) quantitative evaluation of PACAs. This is achieved by simulating psychiatric patients based on a multi-faceted psychiatric construct that defines the simulated patients' profiles, histories, and behaviors, which PACAs are expected to assess. We validate the effectiveness of PSYCHE through a study with 10 board-certified psychiatrists, supported by an in-depth analysis of the simulated patient utterances.
Paper Structure (8 sections, 12 figures, 15 tables)

This paper contains 8 sections, 12 figures, 15 tables.

Figures (12)

  • Figure 1: PSYCHE: Our proposed framework for evaluating psychiatric assessment conversational agents (PACAs). The multi-faceted construct (MFC) is generated to create simulated patients of PSYCHE (PSYCHE-SPs), implementing construct-grounded patient utterance simulation. Parts of the MFC are also utilized to evaluate the PACA. With this, PSYCHE enables clinically relevant, ethically safe, cost-efficient, and quantitative evaluation. The figure illustrates (a) and (b) conventional approaches compared to (c) the proposed PSYCHE approach, highlighting how PSYCHE addresses the limitations of conventional methods.
  • Figure 2: A schematic illustration of the PSYCHE framework. The process flows through four stages: (a) user input of desired diagnosis/age/sex for psychiatric assessment conversational agent (PACA) evaluation, (b) stepwise multi-faceted construct (MFC) generation of profile, history, and behavior for simulated patient (SP), (c) utterance simulation between PSYCHE's SP (PSYCHE-SP) fed with the MFC and PACA, and (d) evaluation session conducting construct-grounded evaluation.
  • Figure 3: Heatmap of conformity scores (%) for PSYCHE-SP simulating each of the seven target disorders across 24 elements within the multi-faceted construct (MFC). The elements corresponding to the x-axis labels belong to either categories of MFC-Profile (e.g., chief complaint, present illness) or MFC-Behavior. The heatmap displays conformity percentages, with color gradients indicating the degree of conformity (low: light blue, high: dark blue).
  • Figure 4: Scatter plots of PSYCHE SCORE versus expert score or PIQSCA, and correlation heatmaps for weight-correlation analysis. (a) Scatter plot showing strong correlation between PSYCHE and expert scores ($r = 0.8486, p < 0.0001$) across four PACA types with five evaluations each ($n=20$), with 'guided prompt' versions consistently receiving higher evaluations than 'basic prompt' versions. (b) Scatter plot illustrating moderate positive correlation between PSYCHE and PIQSCA scores ($r = 0.6367, p = 0.0025$) for the same set of evaluations, validating PSYCHE's alignment with established interview quality metrics. Both scatter plots differentiate between model types and prompts: GPT and Claude models with either basic (circles) or guided (stars) prompts, with regression lines and 95% confidence intervals shown in blue. (c) Correlation heatmap between PSYCHE and expert scores under varying importance weights ($w_{\text{Impulsivity}}$ and $w_{\text{Behavior}}$), showing robust correlations ranging from 0.78 to 0.94. The purple square ($\blacksquare$) indicates the selected weights ($w_{\text{Impulsivity}} = 5, w_{\text{Behavior}} = 2, w_{\text{Subjective}} = 1$). (d) Correlation heatmap with expert score weights fixed at ($w_{\text{Impulsivity}} = 5, w_{\text{Behavior}} = 2, w_{\text{Subjective}} = 1$), demonstrating that the chosen PSYCHE weights fall within an optimal range.
  • Figure 5: Ablation study result comparing NoMFC (simply instructed to simulate the target disorder), NoMFCBehavior (PSYCHE-SP without MFC-Behavior), and PSYCHE-SP (our proposed Simulated Patient model) across three categories: Speech Characteristics and Thought Process, Mood, and Affect. Error bars represent standard deviation. Asterisks (*) indicate statistical significance ($p < 0.05$).
  • ...and 7 more figures