CPsyExam: A Chinese Benchmark for Evaluating Psychology using Examinations

Jiahao Zhao; Jingwei Zhu; Minghuan Tan; Min Yang; Renhao Li; Di Yang; Chenhao Zhang; Guancheng Ye; Chengming Li; Xiping Hu; Derek F. Wong

CPsyExam: A Chinese Benchmark for Evaluating Psychology using Examinations

Jiahao Zhao, Jingwei Zhu, Minghuan Tan, Min Yang, Renhao Li, Di Yang, Chenhao Zhang, Guancheng Ye, Chengming Li, Xiping Hu, Derek F. Wong

TL;DR

CPsyExam introduces a comprehensive Chinese psychology benchmark derived from four national exam systems, partitioned into Knowledge (KG) and Case Analysis (CA) to simultaneously assess theoretical understanding and real-world application. Built from 22k questions and distilled to 4k for evaluation and SFT, the dataset emphasizes balanced subject coverage and diverse formats (SCQ, MAQ, QA) across KG and CA tasks. Across open-source, psychology-focused, and proprietary LLMs, the results show limited gains from domain-specific fine-tuning on psychological content, with GPT-4 and certain CA-focused prompts leading performance in knowledge and case reasoning respectively. CPsyExam demonstrates strong utility for evaluating and guiding improvements in LLM psychology understanding and reasoning, and provides SFT data to enhance model competence in Chinese psychology.

Abstract

In this paper, we introduce a novel psychological benchmark, CPsyExam, constructed from questions sourced from Chinese language examinations. CPsyExam is designed to prioritize psychological knowledge and case analysis separately, recognizing the significance of applying psychological knowledge to real-world scenarios. From the pool of 22k questions, we utilize 4k to create the benchmark that offers balanced coverage of subjects and incorporates a diverse range of case analysis techniques.Furthermore, we evaluate a range of existing large language models~(LLMs), spanning from open-sourced to API-based models. Our experiments and analysis demonstrate that CPsyExam serves as an effective benchmark for enhancing the understanding of psychology within LLMs and enables the comparison of LLMs across various granularities.

CPsyExam: A Chinese Benchmark for Evaluating Psychology using Examinations

TL;DR

Abstract

Paper Structure (43 sections, 8 figures, 5 tables)

This paper contains 43 sections, 8 figures, 5 tables.

Introduction
Related Work
Psychology examination for humans
Benchmarks of Large Language Models
CPsyExam Benchmark
Design Principles
Comprehensive and Balanced
Assessing Multi-capability
Diverse Question Formats
Data Preparation
The Chinese Examination System Including Psychology Subjects
Data Collection
Data preprocessing
Taxonomy of CPsyExam
CPsyExam-KG task
...and 28 more sections

Figures (8)

Figure 1: Overview of dataset constructing pipeline.
Figure 2: Examples for questions on CPsyExam-SCQ and CPsyExam-MAQ.
Figure 3: Performance over SCQ from different perspectives for all LLMs.
Figure 4: Comparison of ChatGLM-Turbo and GPT-4 across different subjects. The bars are sorted in ascend order based on GPT-4's performance over each subject.
Figure 5: Prompt used for evaluation (expert).
...and 3 more figures

CPsyExam: A Chinese Benchmark for Evaluating Psychology using Examinations

TL;DR

Abstract

CPsyExam: A Chinese Benchmark for Evaluating Psychology using Examinations

Authors

TL;DR

Abstract

Table of Contents

Figures (8)