Table of Contents
Fetching ...

Using LLM-as-a-Judge/Jury to Advance Scalable, Clinically-Validated Safety Evaluations of Model Responses to Users Demonstrating Psychosis

May Lynn Reese, Markela Zeneli, Mindy Ng, Jacob Haimes, Andreea Damien, Elizabeth Stade

Abstract

General-purpose Large Language Models (LLMs) are becoming widely adopted by people for mental health support. Yet emerging evidence suggests there are significant risks associated with high-frequency use, particularly for individuals suffering from psychosis, as LLMs may reinforce delusions and hallucinations. Existing evaluations of LLMs in mental health contexts are limited by a lack of clinical validation and scalability of assessment. To address these issues, this research focuses on psychosis as a critical condition for LLM safety evaluation by (1) developing and validating seven clinician-informed safety criteria, (2) constructing a human-consensus dataset, and (3) testing automated assessment using an LLM as an evaluator (LLM-as-a-Judge) or taking the majority vote of several LLM judges (LLM-as-a-Jury). Results indicate that LLM-as-a-Judge aligns closely with the human consensus (Cohen's $κ_{\text{human} \times \text{gemini}} = 0.75$, $κ_{\text{human} \times \text{qwen}} = 0.68$, $κ_{\text{human} \times \text{kimi}} = 0.56$) and that the best judge slightly outperforms LLM-as-a-Jury (Cohen's $κ_{\text{human} \times \text{jury}} = 0.74$). Overall, these findings have promising implications for clinically grounded, scalable methods in LLM safety evaluations for mental health contexts.

Using LLM-as-a-Judge/Jury to Advance Scalable, Clinically-Validated Safety Evaluations of Model Responses to Users Demonstrating Psychosis

Abstract

General-purpose Large Language Models (LLMs) are becoming widely adopted by people for mental health support. Yet emerging evidence suggests there are significant risks associated with high-frequency use, particularly for individuals suffering from psychosis, as LLMs may reinforce delusions and hallucinations. Existing evaluations of LLMs in mental health contexts are limited by a lack of clinical validation and scalability of assessment. To address these issues, this research focuses on psychosis as a critical condition for LLM safety evaluation by (1) developing and validating seven clinician-informed safety criteria, (2) constructing a human-consensus dataset, and (3) testing automated assessment using an LLM as an evaluator (LLM-as-a-Judge) or taking the majority vote of several LLM judges (LLM-as-a-Jury). Results indicate that LLM-as-a-Judge aligns closely with the human consensus (Cohen's , , ) and that the best judge slightly outperforms LLM-as-a-Jury (Cohen's ). Overall, these findings have promising implications for clinically grounded, scalable methods in LLM safety evaluations for mental health contexts.

Paper Structure

This paper contains 23 sections, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Criterion-specific reliability (Cohen’s Kappa) between human consensus and Gemini, Qwen and Kimi. Criteria are abbreviated as follows: Criterion 1 = “Stigmatizes”, Criterion 2 = “Validates Delusion”, Criterion 3 = “Embellishes”, Criterion 4 = “Challenges”, Criterion 5 = “No Referral”, Criterion 6 = “Provides Non-Referral Advice”, and Criterion 7 = “Continues Conversation”.
  • Figure 2: Criterion-specific reliability (Cohen’s Kappa) between human consensus and Jury of 3 models. Criteria are abbreviated as follows: Criterion 1 = “Stigmatizes”, Criterion 2 = “Validates Delusion”, Criterion 3 = “Embellishes”, Criterion 4 = “Challenges”, Criterion 5 = “No Referral”, Criterion 6 = “Provides Non-Referral Advice”, and Criterion 7 = “Continues Conversation”.