Table of Contents
Fetching ...

CARE-Bench: A Benchmark of Diverse Client Simulations Guided by Expert Principles for Evaluating LLMs in Psychological Counseling

Bichen Wang, Yixin Sun, Junzhe Wang, Hao Yang, Xing Fu, Yanyan Zhao, Si Wei, Shijin Wang, Bing Qin

TL;DR

CARE-Bench targets the global gap in psychological counseling by providing a rigorous benchmark for evaluating LLMs in clinical-like dialogues. It combines diverse, real-world client profiles with expert-guided, multi-turn simulations and evaluates performance across three established scales: therapeutic relationship (WAI), empathic understanding (BLRI), and counseling skills (CCS-R). The study demonstrates that current models share common weaknesses, especially with certain personality traits and demographics, and it analyzes the root causes with expert commentary to guide future developments. While grounded in a Chinese context, CARE-Bench offers a scalable framework for broader adoption and cross-cultural adaptation with appropriate safeguards.

Abstract

The mismatch between the growing demand for psychological counseling and the limited availability of services has motivated research into the application of Large Language Models (LLMs) in this domain. Consequently, there is a need for a robust and unified benchmark to assess the counseling competence of various LLMs. Existing works, however, are limited by unprofessional client simulation, static question-and-answer evaluation formats, and unidimensional metrics. These limitations hinder their effectiveness in assessing a model's comprehensive ability to handle diverse and complex clients. To address this gap, we introduce \textbf{CARE-Bench}, a dynamic and interactive automated benchmark. It is built upon diverse client profiles derived from real-world counseling cases and simulated according to expert guidelines. CARE-Bench provides a multidimensional performance evaluation grounded in established psychological scales. Using CARE-Bench, we evaluate several general-purpose LLMs and specialized counseling models, revealing their current limitations. In collaboration with psychologists, we conduct a detailed analysis of the reasons for LLMs' failures when interacting with clients of different types, which provides directions for developing more comprehensive, universal, and effective counseling models.

CARE-Bench: A Benchmark of Diverse Client Simulations Guided by Expert Principles for Evaluating LLMs in Psychological Counseling

TL;DR

CARE-Bench targets the global gap in psychological counseling by providing a rigorous benchmark for evaluating LLMs in clinical-like dialogues. It combines diverse, real-world client profiles with expert-guided, multi-turn simulations and evaluates performance across three established scales: therapeutic relationship (WAI), empathic understanding (BLRI), and counseling skills (CCS-R). The study demonstrates that current models share common weaknesses, especially with certain personality traits and demographics, and it analyzes the root causes with expert commentary to guide future developments. While grounded in a Chinese context, CARE-Bench offers a scalable framework for broader adoption and cross-cultural adaptation with appropriate safeguards.

Abstract

The mismatch between the growing demand for psychological counseling and the limited availability of services has motivated research into the application of Large Language Models (LLMs) in this domain. Consequently, there is a need for a robust and unified benchmark to assess the counseling competence of various LLMs. Existing works, however, are limited by unprofessional client simulation, static question-and-answer evaluation formats, and unidimensional metrics. These limitations hinder their effectiveness in assessing a model's comprehensive ability to handle diverse and complex clients. To address this gap, we introduce \textbf{CARE-Bench}, a dynamic and interactive automated benchmark. It is built upon diverse client profiles derived from real-world counseling cases and simulated according to expert guidelines. CARE-Bench provides a multidimensional performance evaluation grounded in established psychological scales. Using CARE-Bench, we evaluate several general-purpose LLMs and specialized counseling models, revealing their current limitations. In collaboration with psychologists, we conduct a detailed analysis of the reasons for LLMs' failures when interacting with clients of different types, which provides directions for developing more comprehensive, universal, and effective counseling models.

Paper Structure

This paper contains 33 sections, 7 figures, 8 tables.

Figures (7)

  • Figure 1: A comparison between CARE-Bench and previous benchmarks. CARE-Bench features more diverse client profiles and employs an expert-guided client simulation that engages in dynamic multi-turn interactions with counselor models. It adopts a multidimensional evaluation by selecting scales across therapeutic relationship, empathic understanding, and counseling skills.
  • Figure 2: Topic distribution of CARE-Bench. The number for each topic represents the case count.
  • Figure 3: The generalizability results for representative models of three different types, which are grouped by the Big Five Personality Traits categorized as "High" or "Low". The letters on the x-axis correspond to the five dimensions: Openness, Conscientiousness, Extraversion, Agreeableness, and Neuroticism. The y-axis indicates the models' average counseling score per item. Significance tests are conducted for all results (*: p $<$ 0.05, **: p $<$ 0.01, ***: p $<$ 0.001).
  • Figure 4: The interface for collecting simulation principles includes three main sections: the left panel displays the client profile, usage instructions, and collected principles; the center panel presents the dialogue between the psychologist and the simulated client; and the right panel provides a space for the psychologist to give feedback.
  • Figure 5: The generalizability results grouped by Counseling Topic. "Rel."stands for "Relationship".
  • ...and 2 more figures