Exploring Safety Alignment Evaluation of LLMs in Chinese Mental Health Dialogues via LLM-as-Judge
Yunna Cai, Fan Wang, Haowei Wang, Kun Wang, Kailai Yang, Sophia Ananiadou, Moyan Li, Mingming Fan
TL;DR
This work tackles the challenge of evaluating safety alignment for LLM responses in high-risk Chinese mental health dialogues without gold-standard references. It introduces PsyCrisis, a reference-free evaluation framework that uses an expert-grounded LLM-as-Judge with in-context chain-of-thought reasoning to score five binary safety dimensions, producing interpretable rationales. The authors curate a Chinese crisis dataset of 608 real utterances across suicidal ideation, non-suicidal self-injury, and existential distress and demonstrate that their method achieves higher agreement with human experts than baselines, along with superior explanation quality. The framework and dataset are publicly released, offering a practical, scalable approach to safety evaluation in sensitive mental health NLP and setting a foundation for responsible AI in high-stakes crisis contexts.
Abstract
Evaluating the safety alignment of LLM responses in high-risk mental health dialogues is particularly difficult due to missing gold-standard answers and the ethically sensitive nature of these interactions. To address this challenge, we propose PsyCrisis-Bench, a reference-free evaluation benchmark based on real-world Chinese mental health dialogues. It evaluates whether the model responses align with the safety principles defined by experts. Specifically designed for settings without standard references, our method adopts a prompt-based LLM-as-Judge approach that conducts in-context evaluation using expert-defined reasoning chains grounded in psychological intervention principles. We employ binary point-wise scoring across multiple safety dimensions to enhance the explainability and traceability of the evaluation. Additionally, we present a manually curated, high-quality Chinese-language dataset covering self-harm, suicidal ideation, and existential distress, derived from real-world online discourse. Experiments on 3600 judgments show that our method achieves the highest agreement with expert assessments and produces more interpretable evaluation rationales compared to existing approaches. Our dataset and evaluation tool are publicly available to facilitate further research.
