Do Large Language Models Align with Core Mental Health Counseling Competencies?
Viet Cuong Nguyen, Mohammad Taher, Dongwan Hong, Vinicius Konkolics Possobom, Vibha Thirunellayi Gopalakrishnan, Ekta Raj, Zihang Li, Heather J. Soled, Michael L. Birnbaum, Srijan Kumar, Munmun De Choudhury
TL;DR
This paper introduces CounselingBench, a large-scale, NCMHCE-based benchmark designed to evaluate large language models (LLMs) against five core mental health counseling competencies. It systematically benchmarks 22 LLMs (generalist and medical-specialized) across zero-shot, few-shot, and chain-of-thought prompting, revealing that frontier models can exceed the average passing threshold ($63\%$) but fall well short of expert-level performance (~$90\%$). Medical LLMs rarely outperform their generalist counterparts in accuracy, though they sometimes produce marginally better justifications while committing more context-related errors, underscoring a mismatch between biomedical fine-tuning and broad counseling tasks. The results highlight persistent challenges in encoding empathy and nuanced clinical reasoning into AI, arguing for specialized, carefully fine-tuned models with robust human oversight before real-world deployment. CounselingBench and its data/code are available at the provided GitHub repository to facilitate ongoing research and standardization in AI-assisted mental health practice.
Abstract
The rapid evolution of Large Language Models (LLMs) presents a promising solution to the global shortage of mental health professionals. However, their alignment with essential counseling competencies remains underexplored. We introduce CounselingBench, a novel NCMHCE-based benchmark evaluating 22 general-purpose and medical-finetuned LLMs across five key competencies. While frontier models surpass minimum aptitude thresholds, they fall short of expert-level performance, excelling in Intake, Assessment & Diagnosis but struggling with Core Counseling Attributes and Professional Practice & Ethics. Surprisingly, medical LLMs do not outperform generalist models in accuracy, though they provide slightly better justifications while making more context-related errors. These findings highlight the challenges of developing AI for mental health counseling, particularly in competencies requiring empathy and nuanced reasoning. Our results underscore the need for specialized, fine-tuned models aligned with core mental health counseling competencies and supported by human oversight before real-world deployment. Code and data associated with this manuscript can be found at: https://github.com/cuongnguyenx/CounselingBench
