Evaluating Large Language Models with Psychometrics
Yuan Li, Yue Huang, Hongyi Wang, Ying Cheng, Xiangliang Zhang, James Zou, Lichao Sun
TL;DR
The paper presents a comprehensive psychometric benchmark to quantify five psychological constructs in large language models using 13 diverse datasets. It articulates a three-part framework—dimension identification, dataset design, and results validation—and evaluates nine popular LLMs across tasks for personality, values, emotion, theory of mind, and motivation. Key findings reveal inconsistencies between self-reported traits and real-world-like responses, variable reliability across item forms, and notable sensitivity to prompts and adversarial inputs. The work advances reliable evaluation for AI systems in social science contexts and informs responsible deployment of LLM-based assistants. It also outlines practical implications for AI safety, trust, and interdisciplinary research in psychology and social sciences.
Abstract
Large Language Models (LLMs) have demonstrated exceptional capabilities in solving various tasks, progressively evolving into general-purpose assistants. The increasing integration of LLMs into society has sparked interest in whether they exhibit psychological patterns, and whether these patterns remain consistent across different contexts -- questions that could deepen the understanding of their behaviors. Inspired by psychometrics, this paper presents a {comprehensive benchmark for quantifying psychological constructs of LLMs}, encompassing psychological dimension identification, assessment dataset design, and assessment with results validation. Our work identifies five key psychological constructs -- personality, values, emotional intelligence, theory of mind, and self-efficacy -- assessed through a suite of 13 datasets featuring diverse scenarios and item types. We uncover significant discrepancies between LLMs' self-reported traits and their response patterns in real-world scenarios, revealing complexities in their behaviors. Our findings also show that some preference-based tests, originally designed for humans, could not solicit reliable responses from LLMs. This paper offers a thorough psychometric assessment of LLMs, providing insights into reliable evaluation and potential applications in AI and social sciences.
