TrustMH-Bench: A Comprehensive Benchmark for Evaluating the Trustworthiness of Large Language Models in Mental Health

Zixin Xiong; Ziteng Wang; Haotian Fan; Xinjie Zhang; Wenxuan Wang

TrustMH-Bench: A Comprehensive Benchmark for Evaluating the Trustworthiness of Large Language Models in Mental Health

Zixin Xiong, Ziteng Wang, Haotian Fan, Xinjie Zhang, Wenxuan Wang

TL;DR

TrustMH-Bench is a holistic framework designed to systematically quantify the trustworthiness of mental health LLMs, and evaluates models across eight core pillars: Reliability, Crisis Identification and Escalation, Safety, Fairness, Privacy, Robustness, Anti-sycophancy, and Ethics.

Abstract

While Large Language Models (LLMs) demonstrate significant potential in providing accessible mental health support, their practical deployment raises critical trustworthiness concerns due to the domains high-stakes and safety-sensitive nature. Existing evaluation paradigms for general-purpose LLMs fail to capture mental health-specific requirements, highlighting an urgent need to prioritize and enhance their trustworthiness. To address this, we propose TrustMH-Bench, a holistic framework designed to systematically quantify the trustworthiness of mental health LLMs. By establishing a deep mapping from domain-specific norms to quantitative evaluation metrics, TrustMH-Bench evaluates models across eight core pillars: Reliability, Crisis Identification and Escalation, Safety, Fairness, Privacy, Robustness, Anti-sycophancy, and Ethics. We conduct extensive experiments across six general-purpose LLMs and six specialized mental health models. Experimental results indicate that the evaluated models underperform across various trustworthiness dimensions in mental health scenarios, revealing significant deficiencies. Notably, even generally powerful models (e.g., GPT-5.1) fail to maintain consistently high performance across all dimensions. Consequently, systematically improving the trustworthiness of LLMs has become a critical task. Our data and code are released.

TrustMH-Bench: A Comprehensive Benchmark for Evaluating the Trustworthiness of Large Language Models in Mental Health

TL;DR

Abstract

Paper Structure (78 sections, 1 equation, 10 figures, 24 tables)

This paper contains 78 sections, 1 equation, 10 figures, 24 tables.

Introduction
Related Work
TrustMH-Bench
Reliability
Knowledge
Emotion Recognition
Psychological Diagnosis
Emotional Support
Psychological Intervention
Crisis Identification and Escalation
Crisis Identification
Crisis Escalation
Safety
Jailbreak Resistance
Toxicity
...and 63 more sections

Figures (10)

Figure 1: Overview of our framework.
Figure 2: Overall performance rankings of the evaluated models.
Figure 3: Perturbation pipeline.
Figure 4: Normalized confusion matrices for crisis classification task across LLMs. Each matrix evaluates classification performance across seven crisis-related categories: anxiety_crisis (AC), no_crisis (NC), risk_taking_behaviours (RB), self-harm (SH), substance_abuse_or_withdrawal (SW), suicidal_ideation (SI), and violent_thoughts (VT). Diagonal elements represent correct classifications, while off-diagonal entries reveal systematic misclassification patterns among these clinically relevant categories.
Figure 5: Normalized confusion matrices for all evaluated models on the five-level suicide risk severity identification task (C-SSRS). Each subplot corresponds to one model—including general-purpose and mental health-specialized LLMs—illustrating prediction distributions against expert annotations and revealing systematic error patterns across risk levels.
...and 5 more figures

TrustMH-Bench: A Comprehensive Benchmark for Evaluating the Trustworthiness of Large Language Models in Mental Health

TL;DR

Abstract

TrustMH-Bench: A Comprehensive Benchmark for Evaluating the Trustworthiness of Large Language Models in Mental Health

Authors

TL;DR

Abstract

Table of Contents

Figures (10)