Table of Contents
Fetching ...

Evaluating Large Language Models with Psychometrics

Yuan Li, Yue Huang, Hongyi Wang, Ying Cheng, Xiangliang Zhang, James Zou, Lichao Sun

TL;DR

The paper presents a comprehensive psychometric benchmark to quantify five psychological constructs in large language models using 13 diverse datasets. It articulates a three-part framework—dimension identification, dataset design, and results validation—and evaluates nine popular LLMs across tasks for personality, values, emotion, theory of mind, and motivation. Key findings reveal inconsistencies between self-reported traits and real-world-like responses, variable reliability across item forms, and notable sensitivity to prompts and adversarial inputs. The work advances reliable evaluation for AI systems in social science contexts and informs responsible deployment of LLM-based assistants. It also outlines practical implications for AI safety, trust, and interdisciplinary research in psychology and social sciences.

Abstract

Large Language Models (LLMs) have demonstrated exceptional capabilities in solving various tasks, progressively evolving into general-purpose assistants. The increasing integration of LLMs into society has sparked interest in whether they exhibit psychological patterns, and whether these patterns remain consistent across different contexts -- questions that could deepen the understanding of their behaviors. Inspired by psychometrics, this paper presents a {comprehensive benchmark for quantifying psychological constructs of LLMs}, encompassing psychological dimension identification, assessment dataset design, and assessment with results validation. Our work identifies five key psychological constructs -- personality, values, emotional intelligence, theory of mind, and self-efficacy -- assessed through a suite of 13 datasets featuring diverse scenarios and item types. We uncover significant discrepancies between LLMs' self-reported traits and their response patterns in real-world scenarios, revealing complexities in their behaviors. Our findings also show that some preference-based tests, originally designed for humans, could not solicit reliable responses from LLMs. This paper offers a thorough psychometric assessment of LLMs, providing insights into reliable evaluation and potential applications in AI and social sciences.

Evaluating Large Language Models with Psychometrics

TL;DR

The paper presents a comprehensive psychometric benchmark to quantify five psychological constructs in large language models using 13 diverse datasets. It articulates a three-part framework—dimension identification, dataset design, and results validation—and evaluates nine popular LLMs across tasks for personality, values, emotion, theory of mind, and motivation. Key findings reveal inconsistencies between self-reported traits and real-world-like responses, variable reliability across item forms, and notable sensitivity to prompts and adversarial inputs. The work advances reliable evaluation for AI systems in social science contexts and informs responsible deployment of LLM-based assistants. It also outlines practical implications for AI safety, trust, and interdisciplinary research in psychology and social sciences.

Abstract

Large Language Models (LLMs) have demonstrated exceptional capabilities in solving various tasks, progressively evolving into general-purpose assistants. The increasing integration of LLMs into society has sparked interest in whether they exhibit psychological patterns, and whether these patterns remain consistent across different contexts -- questions that could deepen the understanding of their behaviors. Inspired by psychometrics, this paper presents a {comprehensive benchmark for quantifying psychological constructs of LLMs}, encompassing psychological dimension identification, assessment dataset design, and assessment with results validation. Our work identifies five key psychological constructs -- personality, values, emotional intelligence, theory of mind, and self-efficacy -- assessed through a suite of 13 datasets featuring diverse scenarios and item types. We uncover significant discrepancies between LLMs' self-reported traits and their response patterns in real-world scenarios, revealing complexities in their behaviors. Our findings also show that some preference-based tests, originally designed for humans, could not solicit reliable responses from LLMs. This paper offers a thorough psychometric assessment of LLMs, providing insights into reliable evaluation and potential applications in AI and social sciences.

Paper Structure

This paper contains 35 sections, 11 equations, 8 figures, 34 tables.

Figures (8)

  • Figure 1: Overview of Our Psychometrics Benchmark for Large Language Models.
  • Figure 2: BFI and vignette test scores of Mixtral-8*7b under naive prompts (left) and role-playing prompts (right). The responses on Neuroticism aspect are shown in the text boxes.
  • Figure 3: Heatmaps for the averaged personality scores for BFI and vignette test with different prompts. $\text{P}^2$ means personality prompts, $\neg \text{P}^2$ means reverse personality prompts.
  • Figure 4: Results of Human-Centered Values survey, including regular and adversarial versions.
  • Figure 5: The confidence level in LLM Self-Efficacy questionnaire and HoneSet dataset for GPT-4 (left) and Mixtral-8*7b (right).
  • ...and 3 more figures