Table of Contents
Fetching ...

A Computational Framework for Behavioral Assessment of LLM Therapists

Yu Ying Chiu, Ashish Sharma, Inna Wanyin Lin, Tim Althoff

TL;DR

BOLT is proposed, a proof-of-concept computational framework to systematically assess the conversational behavior of LLM therapists and reveals that LLMs often resemble behaviors more commonly exhibited in low-quality therapy rather than high-quality therapy, but unlike low-quality therapy, LLMs reflect significantly more upon clients' needs and strengths.

Abstract

The emergence of large language models (LLMs) like ChatGPT has increased interest in their use as therapists to address mental health challenges and the widespread lack of access to care. However, experts have emphasized the critical need for systematic evaluation of LLM-based mental health interventions to accurately assess their capabilities and limitations. Here, we propose BOLT, a proof-of-concept computational framework to systematically assess the conversational behavior of LLM therapists. We quantitatively measure LLM behavior across 13 psychotherapeutic approaches with in-context learning methods. Then, we compare the behavior of LLMs against high- and low-quality human therapy. Our analysis based on Motivational Interviewing therapy reveals that LLMs often resemble behaviors more commonly exhibited in low-quality therapy rather than high-quality therapy, such as offering a higher degree of problem-solving advice when clients share emotions. However, unlike low-quality therapy, LLMs reflect significantly more upon clients' needs and strengths. Our findings caution that LLM therapists still require further research for consistent, high-quality care.

A Computational Framework for Behavioral Assessment of LLM Therapists

TL;DR

BOLT is proposed, a proof-of-concept computational framework to systematically assess the conversational behavior of LLM therapists and reveals that LLMs often resemble behaviors more commonly exhibited in low-quality therapy rather than high-quality therapy, but unlike low-quality therapy, LLMs reflect significantly more upon clients' needs and strengths.

Abstract

The emergence of large language models (LLMs) like ChatGPT has increased interest in their use as therapists to address mental health challenges and the widespread lack of access to care. However, experts have emphasized the critical need for systematic evaluation of LLM-based mental health interventions to accurately assess their capabilities and limitations. Here, we propose BOLT, a proof-of-concept computational framework to systematically assess the conversational behavior of LLM therapists. We quantitatively measure LLM behavior across 13 psychotherapeutic approaches with in-context learning methods. Then, we compare the behavior of LLMs against high- and low-quality human therapy. Our analysis based on Motivational Interviewing therapy reveals that LLMs often resemble behaviors more commonly exhibited in low-quality therapy rather than high-quality therapy, such as offering a higher degree of problem-solving advice when clients share emotions. However, unlike low-quality therapy, LLMs reflect significantly more upon clients' needs and strengths. Our findings caution that LLM therapists still require further research for consistent, high-quality care.
Paper Structure (37 sections, 7 figures, 30 tables)

This paper contains 37 sections, 7 figures, 30 tables.

Figures (7)

  • Figure 1: Overview of Bolt, a computational framework that enables systematic assessment of the behavior of LLM therapists and compares them to high- and low-quality human therapy.
  • Figure 2: Difference in the frequency of conversational behaviors exhibited by LLM therapists (GPT-4, GPT-3.5-turbo, Llama2-70b, Llama2-13b), relative to average-, low-, and high-quality human therapy. A: average-quality, Low: low-quality, and High: high-quality therapy. The direction of the arrow on the x-axis indicates the direction in which the frequency is increasing (we flip the axis if low-quality is more frequent than high-quality, such that low-quality is visualized below the average quality marker). Values colored in blue indicate desirable behaviors (significantly closer to high-quality than low-quality) whereas values colored in orange indicate undesirable behaviors (significantly closer to low-quality than high-quality). Values in gray are not statistically significantly different from average-quality at p = 0.05 at $p = \frac{0.05}{m}$ using Two-sided Student’s t-test, following Bonferroni correction (m: number of intents tested = 13). Error bars indicate 95% bootstrapped confidence intervals. A key insight we find is that LLMs respond with significantly higher Problem-Solving (subfigure (a)), similar to low-quality human therapy. On the other hand, LLMs respond with significantly higher Reflections on Strengths (subfigure (l)), similar to high-quality therapy, but with a frequency that significantly exceeds high-quality therapy.
  • Figure 3: Difference in the temporal order of conversational behaviors, operationalized as the turn numbers in which behaviors are first exhibited in a conversation by LLM therapists (GPT-4, GPT-3.5-turbo, Llama2-70b, Llama2-13b), relative to average-, low-, and high-quality human therapy. A: average-quality, Low: low-quality, and High: high-quality therapy. The direction of the arrow on the x-axis indicates the direction in which the order is increasing (we flip the axis if low-quality is exhibited later than high-quality, such that low-quality is visualized below the average quality marker). Values colored in blue indicate desirable behaviors (significantly closer to high-quality than low-quality) whereas values colored in orange indicate undesirable behaviors (significantly closer to low-quality than high-quality). Values in gray are not statistically significantly different from average-quality at $p = \frac{0.05}{m}$ using Two-sided Student’s t-test, following Bonferroni correction (m: number of intents tested = 13). Error bars indicate 95% bootstrapped confidence intervals. Most LLM therapists start providing Planning (subfigure (b)) earlier in the conversations but provide Normalizing (subfigure (j)) later against common recommendations cochran2015heart.
  • Figure 4: Difference between the frequency of conversational behaviors observed in LLM therapists (GPT-4, GPT-3.5-turbo, Llama2-70b, Llama2-13b) or low-quality human therapy in response to specific client behaviors (Adaptability), relative to average-, low-, and high-quality human therapy. A: average-quality, Low: low-quality, and High: high-quality therapy. The direction of the arrow on the x-axis indicates the direction in which the frequency is increasing (we flip the axis if low-quality is more frequent than high-quality, such that low-quality is visualized below the average quality marker). Values colored in blue indicate desirable behaviors (significantly closer to high-quality than low-quality) whereas values colored in orange indicate undesirable behaviors (significantly closer to low-quality than high-quality). Values in gray are not statistically significantly different from average-quality at $p = \frac{0.05}{m}$ using Two-sided Student’s t-test, following Bonferroni correction (m: number of (client, therapist) intents tested = 13*6 = 78). Error bars indicate 95% bootstrapped confidence intervals. Here, a key finding is that LLMs respond with significantly lower Questions on Emotions when clients express Sustaining Unhealthy Behavior (subfigure (b)), similar to low-quality human therapy.
  • Figure 5: We incorporate simple prompts that aim to calibrate LLM therapists, specifically (a) increase Questions on Experiences, (b) decrease Problem-Solving, and (c) decrease Normalizing. Subfigures show changes in the frequency of conversational behaviors based on changes in prompts to different LLM therapists (GPT-4, GPT-3.5 turbo, Llama2-70b, Llama2-13b). Changes to individual LLMs are shown in pairs (left -- Original prompt; right -- modulated prompt). The corresponding high-quality human therapy behavior frequency is shown as green dashed lines. For instance, GPT-4 increases the frequency of Questions on Experiences from 29.9% to 57.0% with the modulated prompt. In general, we find that only GPT-4 is able to modulate behavior frequency to a statistically and practically significant amount into the desired direction always, whereas the modulation is inconsistent for other models. Error bars indicate 95% bootstrapped confidence intervals.
  • ...and 2 more figures