Table of Contents
Fetching ...

Exploring Safety Alignment Evaluation of LLMs in Chinese Mental Health Dialogues via LLM-as-Judge

Yunna Cai, Fan Wang, Haowei Wang, Kun Wang, Kailai Yang, Sophia Ananiadou, Moyan Li, Mingming Fan

TL;DR

This work tackles the challenge of evaluating safety alignment for LLM responses in high-risk Chinese mental health dialogues without gold-standard references. It introduces PsyCrisis, a reference-free evaluation framework that uses an expert-grounded LLM-as-Judge with in-context chain-of-thought reasoning to score five binary safety dimensions, producing interpretable rationales. The authors curate a Chinese crisis dataset of 608 real utterances across suicidal ideation, non-suicidal self-injury, and existential distress and demonstrate that their method achieves higher agreement with human experts than baselines, along with superior explanation quality. The framework and dataset are publicly released, offering a practical, scalable approach to safety evaluation in sensitive mental health NLP and setting a foundation for responsible AI in high-stakes crisis contexts.

Abstract

Evaluating the safety alignment of LLM responses in high-risk mental health dialogues is particularly difficult due to missing gold-standard answers and the ethically sensitive nature of these interactions. To address this challenge, we propose PsyCrisis-Bench, a reference-free evaluation benchmark based on real-world Chinese mental health dialogues. It evaluates whether the model responses align with the safety principles defined by experts. Specifically designed for settings without standard references, our method adopts a prompt-based LLM-as-Judge approach that conducts in-context evaluation using expert-defined reasoning chains grounded in psychological intervention principles. We employ binary point-wise scoring across multiple safety dimensions to enhance the explainability and traceability of the evaluation. Additionally, we present a manually curated, high-quality Chinese-language dataset covering self-harm, suicidal ideation, and existential distress, derived from real-world online discourse. Experiments on 3600 judgments show that our method achieves the highest agreement with expert assessments and produces more interpretable evaluation rationales compared to existing approaches. Our dataset and evaluation tool are publicly available to facilitate further research.

Exploring Safety Alignment Evaluation of LLMs in Chinese Mental Health Dialogues via LLM-as-Judge

TL;DR

This work tackles the challenge of evaluating safety alignment for LLM responses in high-risk Chinese mental health dialogues without gold-standard references. It introduces PsyCrisis, a reference-free evaluation framework that uses an expert-grounded LLM-as-Judge with in-context chain-of-thought reasoning to score five binary safety dimensions, producing interpretable rationales. The authors curate a Chinese crisis dataset of 608 real utterances across suicidal ideation, non-suicidal self-injury, and existential distress and demonstrate that their method achieves higher agreement with human experts than baselines, along with superior explanation quality. The framework and dataset are publicly released, offering a practical, scalable approach to safety evaluation in sensitive mental health NLP and setting a foundation for responsible AI in high-stakes crisis contexts.

Abstract

Evaluating the safety alignment of LLM responses in high-risk mental health dialogues is particularly difficult due to missing gold-standard answers and the ethically sensitive nature of these interactions. To address this challenge, we propose PsyCrisis-Bench, a reference-free evaluation benchmark based on real-world Chinese mental health dialogues. It evaluates whether the model responses align with the safety principles defined by experts. Specifically designed for settings without standard references, our method adopts a prompt-based LLM-as-Judge approach that conducts in-context evaluation using expert-defined reasoning chains grounded in psychological intervention principles. We employ binary point-wise scoring across multiple safety dimensions to enhance the explainability and traceability of the evaluation. Additionally, we present a manually curated, high-quality Chinese-language dataset covering self-harm, suicidal ideation, and existential distress, derived from real-world online discourse. Experiments on 3600 judgments show that our method achieves the highest agreement with expert assessments and produces more interpretable evaluation rationales compared to existing approaches. Our dataset and evaluation tool are publicly available to facilitate further research.

Paper Structure

This paper contains 39 sections, 1 equation, 3 figures, 7 tables.

Figures (3)

  • Figure 1: Overall Framework of PsyCrisis. (1) Dataset Curation: Real-user utterances are collected, filtered, and categorized by risk type, focusing on high-risk scenarios such as suicidal ideation and self-harm. (2) Dialogue Task: The LLM assistant generates open-ended responses to user utterances expressing acute emotional distress. (3)Evaluation: Using another LLM as the evaluator, responses are assessed against multiple expert-defined safety dimensions with binary point-wise scoring, producing interpretable and traceable evaluation results, without golden answers as reference.
  • Figure 2: Agreement between model-generated and expert safety ratings. Models include Gemma-3, LLaMA-3.2, and GPT-4o-2024-08-06, Claude 4, Qwen3. GPT-4o shows the highest alignment across all safety dimensions.
  • Figure 3: Distribution of scoring bias between our LLM-based evaluations and expert annotations. Positive values on the horizontal axis indicate model over-alignment; negative values indicate under-alignment.