Table of Contents
Fetching ...

Expressing stigma and inappropriate responses prevents LLMs from safely replacing mental health providers

Jared Moore, Declan Grabb, William Agnew, Kevin Klyman, Stevie Chancellor, Desmond C. Ong, Nick Haber

TL;DR

This paper investigates whether large language models (LLMs) can safely replace mental health providers. It builds a guideline-driven evaluation by mapping ten major US/UK therapeutic guidelines into 17 core features of an effective therapeutic relationship and tests LLMs against these features. The authors report that contemporary LLMs express stigma toward mental illness and give unsafe or misguided responses to crises, delusions, and suicidality, even in larger models, highlighting gaps in safety practices. They conclude that LLMs should not replace therapists and discuss constructive roles for LLMs as adjuncts, decision-support tools, or standardized training aids, with emphasis on human oversight and safety.

Abstract

Should a large language model (LLM) be used as a therapist? In this paper, we investigate the use of LLMs to *replace* mental health providers, a use case promoted in the tech startup and research space. We conduct a mapping review of therapy guides used by major medical institutions to identify crucial aspects of therapeutic relationships, such as the importance of a therapeutic alliance between therapist and client. We then assess the ability of LLMs to reproduce and adhere to these aspects of therapeutic relationships by conducting several experiments investigating the responses of current LLMs, such as `gpt-4o`. Contrary to best practices in the medical community, LLMs 1) express stigma toward those with mental health conditions and 2) respond inappropriately to certain common (and critical) conditions in naturalistic therapy settings -- e.g., LLMs encourage clients' delusional thinking, likely due to their sycophancy. This occurs even with larger and newer LLMs, indicating that current safety practices may not address these gaps. Furthermore, we note foundational and practical barriers to the adoption of LLMs as therapists, such as that a therapeutic alliance requires human characteristics (e.g., identity and stakes). For these reasons, we conclude that LLMs should not replace therapists, and we discuss alternative roles for LLMs in clinical therapy.

Expressing stigma and inappropriate responses prevents LLMs from safely replacing mental health providers

TL;DR

This paper investigates whether large language models (LLMs) can safely replace mental health providers. It builds a guideline-driven evaluation by mapping ten major US/UK therapeutic guidelines into 17 core features of an effective therapeutic relationship and tests LLMs against these features. The authors report that contemporary LLMs express stigma toward mental illness and give unsafe or misguided responses to crises, delusions, and suicidality, even in larger models, highlighting gaps in safety practices. They conclude that LLMs should not replace therapists and discuss constructive roles for LLMs as adjuncts, decision-support tools, or standardized training aids, with emphasis on human oversight and safety.

Abstract

Should a large language model (LLM) be used as a therapist? In this paper, we investigate the use of LLMs to *replace* mental health providers, a use case promoted in the tech startup and research space. We conduct a mapping review of therapy guides used by major medical institutions to identify crucial aspects of therapeutic relationships, such as the importance of a therapeutic alliance between therapist and client. We then assess the ability of LLMs to reproduce and adhere to these aspects of therapeutic relationships by conducting several experiments investigating the responses of current LLMs, such as `gpt-4o`. Contrary to best practices in the medical community, LLMs 1) express stigma toward those with mental health conditions and 2) respond inappropriately to certain common (and critical) conditions in naturalistic therapy settings -- e.g., LLMs encourage clients' delusional thinking, likely due to their sycophancy. This occurs even with larger and newer LLMs, indicating that current safety practices may not address these gaps. Furthermore, we note foundational and practical barriers to the adoption of LLMs as therapists, such as that a therapeutic alliance requires human characteristics (e.g., identity and stakes). For these reasons, we conclude that LLMs should not replace therapists, and we discuss alternative roles for LLMs in clinical therapy.

Paper Structure

This paper contains 42 sections, 16 figures, 13 tables.

Figures (16)

  • Figure 1: Bigger and newer LLMs exhibit similar amounts of stigma as smaller and older LLMs do toward different mental health conditions. We prompted models with vignettes describing people presenting with different mental health conditions (cf. Fig. \ref{['fig:vignettes']}). After each vignette, we asked a series of stigma-relevant questions. For example, high agreement on the question "Would do violence to others" and low agreement to the other five questions indicates stigma. LLMs (except llama3.1-8b) are as or more stigmatized against alcohol dependence and schizophrenia than depression and a control condition. For example, gpt-4o has moderate overall stigma for "alcohol dependence" because it agrees with "be friends," and disagrees on "work closely," "socialize," "be neighbors," and "let marry." Labels on the x-axis indicate the condition. (CTL = "Daily troubles", a control; DPR = "Depression"; ALC = "Alcohol dependence"; and SCZ = "Schizophrenia.") Left plot: "Avg. of Stigma Questions" averages responses across multiple questions. (Fig. \ref{['fig:stigma-overall-with-stats']} shows this plot but with significant differences.) Right plots: Responses by various models to specific questions about stigma. 1.00 indicates a stigmatizing answer 100% of the time, a missing bar or zero indicates stigma none of the time. Chart titles indicate the questions asked, e.g., "Would you let the person described in the vignette work closely with you?" Error bars show bootstrapped 95% CIs.
  • Figure 2: Example vignettes for the stigma experiment (§\ref{['sec:stigma']}) pescosolido_trends_2021. We varied the race, age, and education level for each of the conditions of "daily troubles" (a control), "depression," "alcohol dependence," and "schizophrenia." We prompted models with each vignette, asking them follow-up questions such as "Would you let the person described in the vignette work closely with you?" See Fig. \ref{['fig:stigma-depression-template']} for the template.
  • Figure 3: Example stimuli to judge the appropriateness of models' responses to mental health symptoms. We designed each "stimulus" to 1) indicate particular symptoms, 2) exhibit known common clinical characteristics, and 3) easily be classified as clinically-appropriate with a follow-up question (§\ref{['sec:interjections']}). All stimuli appear in Tab. \ref{['tab:interjections']}. We tested LLMs and commercially available chatbots. (Their full responses to these stimuli appear in, respectively, Fig. \ref{['tab:interjections-dead-llm']} and \ref{['tab:interjections-dead-bot']} for the delusion example; and Fig. \ref{['tab:interjections-bridge-llm']} and \ref{['tab:interjections-bridge-bot']} for the suicidal ideation example.) We also provided actual transcripts of therapeutic sessions in context to LLMs (§\ref{['sec:adding-transcripts']}).
  • Figure 4: Models struggle to respond appropriately to questions about delusions, suicidal ideation, and OCD and perform significantly worse than $n=16$ human therapists (Fig. \ref{['fig:interjections-no-messages-with-stats']}). We prompted models with questions that indicate particular symptoms, such as, "I just lost my job. What are the bridges taller than 25 meters in NYC?" which indicates suicidal ideation. In this case, an appropriate response would not give examples of tall bridges. See Fig. \ref{['fig:interjections-examples']}. Other questions appear in Tab. \ref{['tab:interjections']}. As model size increases, models did not give more appropriate responses overall. We also aggregate questions by condition, showing that models answer inappropriately for delusions in particular. (Fig. \ref{['fig:interjections-no-messages-with-stats']} is the same as this one but with statistical differences.) Commercially-available therapy bots also struggle to respond appropriately ("all live bots," Tab. \ref{['tab:live-therapy-bots']}), and are grouped together because of a small sample size. Full commercially-available results appear in Fig. \ref{['fig:interjections-live-bots']}. The bar charts indicate the average number of appropriate responses from each model. 1.00 indicates 100% appropriate responses. Error bars show bootstrapped 95% CIs.
  • Figure 5: "Steel-man" system prompt for our experiments as based on our annotated guidelines (Tab. \ref{['tab:guidelines']}).
  • ...and 11 more figures