Table of Contents
Fetching ...

TherapyGym: Evaluating and Aligning Clinical Fidelity and Safety in Therapy Chatbots

Fangrui Huang, Souhad Chbeir, Arpandeep Khatua, Sheng Wang, Sijun Tan, Kenan Ye, Lily Bailey, Merryn Daniel, Ryan Louie, Sanmi Koyejo, Ehsan Adeli

Abstract

Large language models (LLMs) are increasingly used for mental-health support; yet prevailing evaluation methods--fluency metrics, preference tests, and generic dialogue benchmarks--fail to capture the clinically critical dimensions of psychotherapy. We introduce THERAPYGYM, a framework that evaluates and improves therapy chatbots along two clinical pillars: fidelity and safety. Fidelity is measured using the Cognitive Therapy Rating Scale (CTRS), implemented as an automated pipeline that scores adherence to CBT techniques over multi-turn sessions. Safety is assessed using a multi-label annotation scheme, covering therapy-specific risks (e.g., failing to address harm or abuse). To mitigate bias and unreliability in LLM-based judges, we further release THERAPYJUDGEBENCH, a validation set of 116 dialogues with 1,270 expert ratings for auditing and calibration against licensed clinicians. THERAPYGYM also serves as a training harness: CTRS and safety-based rewards drive RL with configurable patient simulations spanning diverse symptom profiles. Models trained in THERAPYGYM improve on expert ratings, with average CTRS rising from 0.10 to 0.60 (and 0.16 to 0.59 under LLM judges). Our work enables scalable development of therapy chatbots that are faithful to evidence-based practice and safer in high-stakes use.

TherapyGym: Evaluating and Aligning Clinical Fidelity and Safety in Therapy Chatbots

Abstract

Large language models (LLMs) are increasingly used for mental-health support; yet prevailing evaluation methods--fluency metrics, preference tests, and generic dialogue benchmarks--fail to capture the clinically critical dimensions of psychotherapy. We introduce THERAPYGYM, a framework that evaluates and improves therapy chatbots along two clinical pillars: fidelity and safety. Fidelity is measured using the Cognitive Therapy Rating Scale (CTRS), implemented as an automated pipeline that scores adherence to CBT techniques over multi-turn sessions. Safety is assessed using a multi-label annotation scheme, covering therapy-specific risks (e.g., failing to address harm or abuse). To mitigate bias and unreliability in LLM-based judges, we further release THERAPYJUDGEBENCH, a validation set of 116 dialogues with 1,270 expert ratings for auditing and calibration against licensed clinicians. THERAPYGYM also serves as a training harness: CTRS and safety-based rewards drive RL with configurable patient simulations spanning diverse symptom profiles. Models trained in THERAPYGYM improve on expert ratings, with average CTRS rising from 0.10 to 0.60 (and 0.16 to 0.59 under LLM judges). Our work enables scalable development of therapy chatbots that are faithful to evidence-based practice and safer in high-stakes use.
Paper Structure (31 sections, 3 equations, 7 figures, 16 tables)

This paper contains 31 sections, 3 equations, 7 figures, 16 tables.

Figures (7)

  • Figure 1: Illustration of TherapyGym workflow. (a)Judge benchmark panel (left): TherapyJudgeBench, a dialogue bank with expert annotations for judge validation. (b) Judge panel (middle): the TherapyJudge evaluates conversations, with its judgments validated against TherapyJudgeBench. (c) RL finetuning panel (right): the LLM therapist is finetuned via reinforcement learning using feedback from the TherapyJudge within the conversation environment.
  • Figure 2: Illustration of conversation labeling. Left: sample dialogue between a simulated patient and an LLM therapist (10 turns; some turns omitted for clarity). Right: dialogue-level annotations from both human and LLM raters. We score the 11 CBT--CTRS aspects on a 0--6 scale (0 = poor, 3 = satisfactory, 6 = excellent; aspect-specific anchors follow the official CTRS rubric), and mark four safety aspects as binary ticks (present/absent). Human and LLM raters use the same CTRS scales, and inter-rater agreement between them is calculated on the dialogue-level labels.
  • Figure 3: Mean normalized scores (0–1) on nine CTRS CBT skills for the Base model and the same model fine-tuned with GRPO. Outward shifts indicate higher competency across skills.
  • Figure 4: Qualitative Comparison of Trained/Untrained Conversations. Left: Conversation before training. Right: Conversation after training. Patient utterances are abbreviated for readability; therapist responses are excerpted from the original dialogue. Conversation after training presents more identifiers that correspond to higher CTRS scoring. Detailed indicator definitions and additional matched dialogue examples are provided in Appen. \ref{['app:qualitative analysis']}.
  • Figure 5: Human Labeling Website: The left side is a conversation of the simulated patient and llm therapist. The right side is the CTRS aspects(from 0 to 6) and safety aspects scoring(true of false).
  • ...and 2 more figures