Table of Contents
Fetching ...

CounselBench: A Large-Scale Expert Evaluation and Adversarial Benchmarking of Large Language Models in Mental Health Question Answering

Yahan Li, Jifan Yao, John Bosco S. Bunyi, Adam C. Frank, Angel Hwang, Ruishan Liu

TL;DR

CounselBench tackles the challenge of evaluating LLMs for open-ended mental health QA by introducing two components: COUNSELBENCH-EVAL, a large-scale, clinician-annotated evaluation of 2,000 QA pairs across six clinical dimensions, and COUNSELBENCH-ADV, an adversarial set of 120 clinician-authored questions designed to trigger specific model weaknesses. Using 100 licensed mental health professionals, the study reveals that while models like LLaMA-3.3 can achieve high scores on several dimensions, safety concerns such as unauthorized medical advice persist, and human evaluations remain more reliable than automated judging. The work also demonstrates that LLMs as judges systematically overrate model outputs and miss safety issues, highlighting limitations in automated evaluation for high-stakes domains. Adversarial probing uncovers model-family-specific failure patterns (e.g., speculation about symptoms, apathy, or judgmental tones) and shows that few-shot prompts offer limited mitigation. By releasing COUNSELBENCH-EVAL data and COUNSELBENCH-ADV prompts, the authors provide resources to strengthen alignment, safety detectors, and robust auditing for mental health QA in real-world deployments.

Abstract

Medical question answering (QA) benchmarks often focus on multiple-choice or fact-based tasks, leaving open-ended answers to real patient questions underexplored. This gap is particularly critical in mental health, where patient questions often mix symptoms, treatment concerns, and emotional needs, requiring answers that balance clinical caution with contextual sensitivity. We present CounselBench, a large-scale benchmark developed with 100 mental health professionals to evaluate and stress-test large language models (LLMs) in realistic help-seeking scenarios. The first component, CounselBench-EVAL, contains 2,000 expert evaluations of answers from GPT-4, LLaMA 3, Gemini, and human therapists on patient questions from the public forum CounselChat. Each answer is rated across six clinically grounded dimensions, with span-level annotations and written rationales. Expert evaluations show that while LLMs achieve high scores on several dimensions, they also exhibit recurring issues, including unconstructive feedback, overgeneralization, and limited personalization or relevance. Responses were frequently flagged for safety risks, most notably unauthorized medical advice. Follow-up experiments show that LLM judges systematically overrate model responses and overlook safety concerns identified by human experts. To probe failure modes more directly, we construct CounselBench-Adv, an adversarial dataset of 120 expert-authored mental health questions designed to trigger specific model issues. Evaluation of 3,240 responses from nine LLMs reveals consistent, model-specific failure patterns. Together, CounselBench establishes a clinically grounded framework for benchmarking LLMs in mental health QA.

CounselBench: A Large-Scale Expert Evaluation and Adversarial Benchmarking of Large Language Models in Mental Health Question Answering

TL;DR

CounselBench tackles the challenge of evaluating LLMs for open-ended mental health QA by introducing two components: COUNSELBENCH-EVAL, a large-scale, clinician-annotated evaluation of 2,000 QA pairs across six clinical dimensions, and COUNSELBENCH-ADV, an adversarial set of 120 clinician-authored questions designed to trigger specific model weaknesses. Using 100 licensed mental health professionals, the study reveals that while models like LLaMA-3.3 can achieve high scores on several dimensions, safety concerns such as unauthorized medical advice persist, and human evaluations remain more reliable than automated judging. The work also demonstrates that LLMs as judges systematically overrate model outputs and miss safety issues, highlighting limitations in automated evaluation for high-stakes domains. Adversarial probing uncovers model-family-specific failure patterns (e.g., speculation about symptoms, apathy, or judgmental tones) and shows that few-shot prompts offer limited mitigation. By releasing COUNSELBENCH-EVAL data and COUNSELBENCH-ADV prompts, the authors provide resources to strengthen alignment, safety detectors, and robust auditing for mental health QA in real-world deployments.

Abstract

Medical question answering (QA) benchmarks often focus on multiple-choice or fact-based tasks, leaving open-ended answers to real patient questions underexplored. This gap is particularly critical in mental health, where patient questions often mix symptoms, treatment concerns, and emotional needs, requiring answers that balance clinical caution with contextual sensitivity. We present CounselBench, a large-scale benchmark developed with 100 mental health professionals to evaluate and stress-test large language models (LLMs) in realistic help-seeking scenarios. The first component, CounselBench-EVAL, contains 2,000 expert evaluations of answers from GPT-4, LLaMA 3, Gemini, and human therapists on patient questions from the public forum CounselChat. Each answer is rated across six clinically grounded dimensions, with span-level annotations and written rationales. Expert evaluations show that while LLMs achieve high scores on several dimensions, they also exhibit recurring issues, including unconstructive feedback, overgeneralization, and limited personalization or relevance. Responses were frequently flagged for safety risks, most notably unauthorized medical advice. Follow-up experiments show that LLM judges systematically overrate model responses and overlook safety concerns identified by human experts. To probe failure modes more directly, we construct CounselBench-Adv, an adversarial dataset of 120 expert-authored mental health questions designed to trigger specific model issues. Evaluation of 3,240 responses from nine LLMs reveals consistent, model-specific failure patterns. Together, CounselBench establishes a clinically grounded framework for benchmarking LLMs in mental health QA.

Paper Structure

This paper contains 31 sections, 6 figures, 14 tables.

Figures (6)

  • Figure 1: Overview of CounselBench benchmark. COUNSELBENCH-EVAL (left) includes expert evaluation of LLMs and human responses to real counseling questions. COUNSELBENCH-ADV (right) includes adversarial questions authored by clinicians to target identified LLM failure modes. See Appendix \ref{['appendix_annotator_demographics']} for license/degree types and specialization areas.
  • Figure 2: Distribution of (A) credential types and (B) counseling experience among the 100 annotators.
  • Figure 3: Average evaluation scores across six dimensions (subplots) for counseling responses generated by GPT-4, LLaMA-3.3, Gemini-1.5-Pro, and online human therapists (x-axis in each subplot). Each colored line represents one evaluator, including nine LLM-based judges and human experts (red). Higher values indicate better performance except for Toxicity and Medical Advice. See Table \ref{['llm_judge_numeric_score']} for full numerical results.
  • Figure 4: Survey interface: annotators read a user post and one response (left) and rate the response on criteria with Likert scales and text‐evidence boxes (right).
  • Figure 5: Ethnicity (left) and gender distribution (right) of the 100 mental-health professional annotators.
  • ...and 1 more figures