Between Help and Harm: An Evaluation of Mental Health Crisis Handling by LLMs
Adrian Arnaiz-Rodriguez, Miguel Baidal, Erik Derner, Jenn Layton Annable, Mark Ball, Mark Ince, Elvira Perez Vallejos, Nuria Oliver
TL;DR
The paper systematically benchmarks state-of-the-art general-purpose LLMs on mental health crisis handling by establishing a unified six-category crisis taxonomy, curating a clinically informed evaluation dataset, and designing an LLM-based judging protocol for crisis detection and response safety. It demonstrates substantial model-to-model variability in detection and safety, with some models providing strongly appropriate responses while others generate harmful or unsafe content, especially for self-harm and suicidal ideation inputs. The authors propose technical, operational, and governance recommendations, including proactive safety prompts, global and context-aware resource localization, uninterrupted crisis access, and FATEN-aligned stewardship, to advance safer, more accountable AI-assisted mental health support. The work provides publicly useful resources—taxonomy, dataset, and evaluation framework—that can accelerate ongoing research and inform responsible deployment of LLMs in high-risk mental health settings.
Abstract
Large language model-powered chatbots have transformed how people seek information, especially in high-stakes contexts like mental health. Despite their support capabilities, safe detection and response to crises such as suicidal ideation and self-harm are still unclear, hindered by the lack of unified crisis taxonomies and clinical evaluation standards. We address this by creating: (1) a taxonomy of six crisis categories; (2) a dataset of over 2,000 inputs from 12 mental health datasets, classified into these categories; and (3) a clinical response assessment protocol. We also use LLMs to identify crisis inputs and audit five models for response safety and appropriateness. First, we built a clinical-informed crisis taxonomy and evaluation protocol. Next, we curated 2,252 relevant examples from over 239,000 user inputs, then tested three LLMs for automatic classification. In addition, we evaluated five models for the appropriateness of their responses to a user's crisis, graded on a 5-point Likert scale from harmful (1) to appropriate (5). While some models respond reliably to explicit crises, risks still exist. Many outputs, especially in self-harm and suicidal categories, are inappropriate or unsafe. Different models perform variably; some, like gpt-5-nano and deepseek-v3.2-exp, have low harm rates, but others, such as gpt-4o-mini and grok-4-fast, generate more unsafe responses. All models struggle with indirect signals, default replies, and context misalignment. These results highlight the urgent need for better safeguards, crisis detection, and context-aware responses in LLMs. They also show that alignment and safety practices, beyond scale, are crucial for reliable crisis support. Our taxonomy, datasets, and evaluation methods support ongoing AI mental health research, aiming to reduce harm and protect vulnerable users.
