Table of Contents
Fetching ...

ConceptPsy:A Benchmark Suite with Conceptual Comprehensiveness in Psychology

Junlei Zhang, Hongliang He, Nirui Song, Zhanchao Zhou, Shuyuan He, Shuai Zhang, Huachuan Qiu, Anqi Li, Yong Dai, Lizhi Ma, Zhenzhong Lan

TL;DR

ConceptPsy introduces a concept-driven, Chinese psychology benchmark to address concept bias in existing MMLU-style tests. It builds 12 psychology subjects with 1383 concepts, generates 4573 questions via GPT-4, and has them reviewed by professional psychologists with chapter-level annotations. The results reveal substantial variation across chapters and concepts, with GPT-4 outperforming humans on average but Chinese models showing math and reasoning weaknesses in psychology contexts. The benchmark provides a practical tool to diagnose weaknesses and guide development of Chinese psychology-focused knowledge and reasoning abilities in large-scale language models.

Abstract

The critical field of psychology necessitates a comprehensive benchmark to enhance the evaluation and development of domain-specific Large Language Models (LLMs). Existing MMLU-type benchmarks, such as C-EVAL and CMMLU, include psychology-related subjects, but their limited number of questions and lack of systematic concept sampling strategies mean they cannot cover the concepts required in psychology. Consequently, despite their broad subject coverage, these benchmarks lack the necessary depth in the psychology domain, making them inadequate as psychology-specific evaluation suite. To address this issue, this paper presents ConceptPsy, designed to evaluate Chinese complex reasoning and knowledge abilities in psychology. ConceptPsy includes 12 core subjects and 1383 manually collected concepts. Specifically, we prompt GPT-4 to generate questions for each concept using carefully designed diverse prompts and hire professional psychologists to review these questions. To help to understand the fine-grained performances and enhance the weaknesses, we annotate each question with a chapter label and provide chapter-wise accuracy. Based on ConceptPsy, we evaluate a broad range of LLMs. We observe that, although some LLMs achieve similar accuracies on overall performances, they exhibit significant performance variations across different psychology concepts, even when they are models from the same series. We hope our work can facilitate the development of LLMs in the field of psychology.

ConceptPsy:A Benchmark Suite with Conceptual Comprehensiveness in Psychology

TL;DR

ConceptPsy introduces a concept-driven, Chinese psychology benchmark to address concept bias in existing MMLU-style tests. It builds 12 psychology subjects with 1383 concepts, generates 4573 questions via GPT-4, and has them reviewed by professional psychologists with chapter-level annotations. The results reveal substantial variation across chapters and concepts, with GPT-4 outperforming humans on average but Chinese models showing math and reasoning weaknesses in psychology contexts. The benchmark provides a practical tool to diagnose weaknesses and guide development of Chinese psychology-focused knowledge and reasoning abilities in large-scale language models.

Abstract

The critical field of psychology necessitates a comprehensive benchmark to enhance the evaluation and development of domain-specific Large Language Models (LLMs). Existing MMLU-type benchmarks, such as C-EVAL and CMMLU, include psychology-related subjects, but their limited number of questions and lack of systematic concept sampling strategies mean they cannot cover the concepts required in psychology. Consequently, despite their broad subject coverage, these benchmarks lack the necessary depth in the psychology domain, making them inadequate as psychology-specific evaluation suite. To address this issue, this paper presents ConceptPsy, designed to evaluate Chinese complex reasoning and knowledge abilities in psychology. ConceptPsy includes 12 core subjects and 1383 manually collected concepts. Specifically, we prompt GPT-4 to generate questions for each concept using carefully designed diverse prompts and hire professional psychologists to review these questions. To help to understand the fine-grained performances and enhance the weaknesses, we annotate each question with a chapter label and provide chapter-wise accuracy. Based on ConceptPsy, we evaluate a broad range of LLMs. We observe that, although some LLMs achieve similar accuracies on overall performances, they exhibit significant performance variations across different psychology concepts, even when they are models from the same series. We hope our work can facilitate the development of LLMs in the field of psychology.
Paper Structure (31 sections, 15 figures, 12 tables)

This paper contains 31 sections, 15 figures, 12 tables.

Figures (15)

  • Figure 1: GPT-3.5-Turbo's concept-wise performance on Psychological Statistics. The x-axis represents the sequence of concepts, arranged in the order they appear in the textbook. The dashed circles represent sampled question sets. Different samplings can mislead people's understanding of a model.
  • Figure 2: Diagram overview of concepts in ConceptPsy. We sample questions based on the requirement of the National Post-graduate Entrance Examination in China. Each question is tagged with a modified chapter name, serving as the chapter-level concept, to further provide chapter-level accuracy.
  • Figure 3: Examples of concepts. We define a "concept" as fundamental units of understanding that encapsulate specific knowledge within a broader field of study.
  • Figure 4: Overview of Our Concept-Driven Framework. We collect relevant concepts based on the requirements of corresponding examinations. To diversify the types of questions, we summarize three question patterns from these exams and design specific prompts for each type. Questions are then generated using GPT-4. Subsequently, we hire professional psychological counselors to review the questions for accuracy and relevance.
  • Figure 5: An example of an annotator assigning a suitable prompt to a concept. For the concept "random error", we collect multiple descriptions. The appropriate prompt is assigned based on the type of description provided.
  • ...and 10 more figures