Table of Contents
Fetching ...

Between Help and Harm: An Evaluation of Mental Health Crisis Handling by LLMs

Adrian Arnaiz-Rodriguez, Miguel Baidal, Erik Derner, Jenn Layton Annable, Mark Ball, Mark Ince, Elvira Perez Vallejos, Nuria Oliver

TL;DR

The paper systematically benchmarks state-of-the-art general-purpose LLMs on mental health crisis handling by establishing a unified six-category crisis taxonomy, curating a clinically informed evaluation dataset, and designing an LLM-based judging protocol for crisis detection and response safety. It demonstrates substantial model-to-model variability in detection and safety, with some models providing strongly appropriate responses while others generate harmful or unsafe content, especially for self-harm and suicidal ideation inputs. The authors propose technical, operational, and governance recommendations, including proactive safety prompts, global and context-aware resource localization, uninterrupted crisis access, and FATEN-aligned stewardship, to advance safer, more accountable AI-assisted mental health support. The work provides publicly useful resources—taxonomy, dataset, and evaluation framework—that can accelerate ongoing research and inform responsible deployment of LLMs in high-risk mental health settings.

Abstract

Large language model-powered chatbots have transformed how people seek information, especially in high-stakes contexts like mental health. Despite their support capabilities, safe detection and response to crises such as suicidal ideation and self-harm are still unclear, hindered by the lack of unified crisis taxonomies and clinical evaluation standards. We address this by creating: (1) a taxonomy of six crisis categories; (2) a dataset of over 2,000 inputs from 12 mental health datasets, classified into these categories; and (3) a clinical response assessment protocol. We also use LLMs to identify crisis inputs and audit five models for response safety and appropriateness. First, we built a clinical-informed crisis taxonomy and evaluation protocol. Next, we curated 2,252 relevant examples from over 239,000 user inputs, then tested three LLMs for automatic classification. In addition, we evaluated five models for the appropriateness of their responses to a user's crisis, graded on a 5-point Likert scale from harmful (1) to appropriate (5). While some models respond reliably to explicit crises, risks still exist. Many outputs, especially in self-harm and suicidal categories, are inappropriate or unsafe. Different models perform variably; some, like gpt-5-nano and deepseek-v3.2-exp, have low harm rates, but others, such as gpt-4o-mini and grok-4-fast, generate more unsafe responses. All models struggle with indirect signals, default replies, and context misalignment. These results highlight the urgent need for better safeguards, crisis detection, and context-aware responses in LLMs. They also show that alignment and safety practices, beyond scale, are crucial for reliable crisis support. Our taxonomy, datasets, and evaluation methods support ongoing AI mental health research, aiming to reduce harm and protect vulnerable users.

Between Help and Harm: An Evaluation of Mental Health Crisis Handling by LLMs

TL;DR

The paper systematically benchmarks state-of-the-art general-purpose LLMs on mental health crisis handling by establishing a unified six-category crisis taxonomy, curating a clinically informed evaluation dataset, and designing an LLM-based judging protocol for crisis detection and response safety. It demonstrates substantial model-to-model variability in detection and safety, with some models providing strongly appropriate responses while others generate harmful or unsafe content, especially for self-harm and suicidal ideation inputs. The authors propose technical, operational, and governance recommendations, including proactive safety prompts, global and context-aware resource localization, uninterrupted crisis access, and FATEN-aligned stewardship, to advance safer, more accountable AI-assisted mental health support. The work provides publicly useful resources—taxonomy, dataset, and evaluation framework—that can accelerate ongoing research and inform responsible deployment of LLMs in high-risk mental health settings.

Abstract

Large language model-powered chatbots have transformed how people seek information, especially in high-stakes contexts like mental health. Despite their support capabilities, safe detection and response to crises such as suicidal ideation and self-harm are still unclear, hindered by the lack of unified crisis taxonomies and clinical evaluation standards. We address this by creating: (1) a taxonomy of six crisis categories; (2) a dataset of over 2,000 inputs from 12 mental health datasets, classified into these categories; and (3) a clinical response assessment protocol. We also use LLMs to identify crisis inputs and audit five models for response safety and appropriateness. First, we built a clinical-informed crisis taxonomy and evaluation protocol. Next, we curated 2,252 relevant examples from over 239,000 user inputs, then tested three LLMs for automatic classification. In addition, we evaluated five models for the appropriateness of their responses to a user's crisis, graded on a 5-point Likert scale from harmful (1) to appropriate (5). While some models respond reliably to explicit crises, risks still exist. Many outputs, especially in self-harm and suicidal categories, are inappropriate or unsafe. Different models perform variably; some, like gpt-5-nano and deepseek-v3.2-exp, have low harm rates, but others, such as gpt-4o-mini and grok-4-fast, generate more unsafe responses. All models struggle with indirect signals, default replies, and context misalignment. These results highlight the urgent need for better safeguards, crisis detection, and context-aware responses in LLMs. They also show that alignment and safety practices, beyond scale, are crucial for reliable crisis support. Our taxonomy, datasets, and evaluation methods support ongoing AI mental health research, aiming to reduce harm and protect vulnerable users.

Paper Structure

This paper contains 23 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Methodology. 1. Dataset Curation (left): From an aggregation of $n{\approx}239$k user textual inputs from 12 publicly available datasets for mental health research, 206 and 2,046 examples are selected as validation and test set examples, respectively. 2. Crisis Category Classification Validation: The validation set ($n{=}206$) is labeled by three state-of-the-art LLMs and four domain experts according to a taxonomy with six mental health crisis categories (suicidal ideation, self-harm, anxiety crisis, violent thoughts, substance abuse or withdrawal, and risk-taking behaviors) and a no-crisis label. The agreement between human annotators and the LLMs is quantified using Cohen's Kappa. As a result of this process, the LLM with the highest agreement with humans (gpt-4o-mini) is selected to annotate the test set. 3. Automatic Crisis Category Classification: Each entry in the test set ($n{=}2{,}046$) is automatically labeled according to the taxonomy by the best performing LLM, namely gpt-4o-mini. 4. LLM responses to User inputs: Five state-of-the-art LLMs (gpt-4o-mini, gpt-5-nano, llama-4-scout, deepseek-v3.2, and grok-4-fast) are probed three times to generate responses for each entry in the test set. 5. Crisis Response Evaluation: The appropriateness of each of the responses of the LLMs is evaluated by an LLM following a psychologist-designed protocol. Responses are rated on a 1-5 scale, ranging from harmful (1) to fully appropriate (5).
  • Figure 2: Left: Pipeline applied to each user input and LLM. The Crisis Category Classification module leverages the LLM-as-a-judge technique to assign a mental health crisis category to the user input. In parallel, the evaluated LLM provides a Response to the same user input. Each response is scored for appropriateness (according to a 5-point Likert scale) by the Crisis Response Evaluation module, using the LLM-as-a-judge technique that follows the evaluation protocol designed by domain experts. Right: Conversation example. An example of a user input labeled in the category of suicidal ideation and the corresponding LLM response, rated as harmful.
  • Figure 3: Crisis category classification pipeline. Left: In the validation stage, three LLMs (each run three times) and four human experts independently labeled the validation set of $206$ user inputs. Agreement between each pair of LLM and human annotations was quantified using Cohen's Kappa, and the model with the highest mean agreement was selected for the second stage. In the second stage, the best-performing model (gpt-4o-mini) was used to label the full dataset ($2{,}046$ samples).
  • Figure 4: Aggregate evaluation results for the three runs per LLM ($n=6{,}132$ per LLM). We report (a) the mean evaluation score with its $95\%$ CI; (b) the average self-agreement of the evaluator LLM (Mean standard deviation) with its $95\%$ CI; the probability (%) of receiving a score within the lowest bins: $[1, 2.3]$ and $(2.3, 3.6]$. The bar segment labeled Score $= 1$ (hatched) is overlaid on the $[1, 2.3]$ bin to specifically highlight the probability (%) of receiving a maximally harmful score.
  • Figure 5: Distribution of Low Safety Scores ($\le 3.6$) per LLM and Mental Health Crisis Category. Bars show the combined percentage of responses scoring between $1$ and $3.6$. The overall low-score distribution is split into: Score = 1 (hatched area), [1, 2.3] (orange), and (2.3, 3.6] (blue).