Table of Contents
Fetching ...

Do Psychometric Tests Work for Large Language Models? Evaluation of Tests on Sexism, Racism, and Morality

Jana Jung, Marlene Lutz, Indira Sen, Markus Strohmaier

TL;DR

The paper interrogates whether human psychometric tests meaningfully assess large language models by examining reliability and validity across sexism, racism, and morality. It implements a systematic framework incorporating alternate forms, option-order variations, and downstream-task ecological validity, finding moderate reliability but consistently low ecological validity, with test scores often failing to predict real-world model behavior. Convergent validity shows theory-consistent inter-test relationships, yet ecological correlations are weak or negative, sometimes increasing with model size. The results argue that psychometric tests designed for humans cannot be directly transferred to LLMs without adaptation, highlighting the need for LLM-specific assessment tools and careful validation practices. The study provides open-source materials and a roadmap for developing behavior-focused, theory-grounded evaluation methods for LLMs.

Abstract

Psychometric tests are increasingly used to assess psychological constructs in large language models (LLMs). However, it remains unclear whether these tests -- originally developed for humans -- yield meaningful results when applied to LLMs. In this study, we systematically evaluate the reliability and validity of human psychometric tests for three constructs: sexism, racism, and morality. We find moderate reliability across multiple item and prompt variations. Validity is evaluated through both convergent (i.e., testing theory-based inter-test correlations) and ecological approaches (i.e., testing the alignment between tests scores and behavior in real-world downstream tasks). Crucially, we find that psychometric test scores do not align, and in some cases even negatively correlate with, model behavior in downstream tasks, indicating low ecological validity. Our results highlight that systematic evaluations of psychometric tests is essential before interpreting their scores. They also suggest that psychometric tests designed for humans cannot be applied directly to LLMs without adaptation.

Do Psychometric Tests Work for Large Language Models? Evaluation of Tests on Sexism, Racism, and Morality

TL;DR

The paper interrogates whether human psychometric tests meaningfully assess large language models by examining reliability and validity across sexism, racism, and morality. It implements a systematic framework incorporating alternate forms, option-order variations, and downstream-task ecological validity, finding moderate reliability but consistently low ecological validity, with test scores often failing to predict real-world model behavior. Convergent validity shows theory-consistent inter-test relationships, yet ecological correlations are weak or negative, sometimes increasing with model size. The results argue that psychometric tests designed for humans cannot be directly transferred to LLMs without adaptation, highlighting the need for LLM-specific assessment tools and careful validation practices. The study provides open-source materials and a roadmap for developing behavior-focused, theory-grounded evaluation methods for LLMs.

Abstract

Psychometric tests are increasingly used to assess psychological constructs in large language models (LLMs). However, it remains unclear whether these tests -- originally developed for humans -- yield meaningful results when applied to LLMs. In this study, we systematically evaluate the reliability and validity of human psychometric tests for three constructs: sexism, racism, and morality. We find moderate reliability across multiple item and prompt variations. Validity is evaluated through both convergent (i.e., testing theory-based inter-test correlations) and ecological approaches (i.e., testing the alignment between tests scores and behavior in real-world downstream tasks). Crucially, we find that psychometric test scores do not align, and in some cases even negatively correlate with, model behavior in downstream tasks, indicating low ecological validity. Our results highlight that systematic evaluations of psychometric tests is essential before interpreting their scores. They also suggest that psychometric tests designed for humans cannot be applied directly to LLMs without adaptation.

Paper Structure

This paper contains 43 sections, 4 equations, 14 figures, 16 tables.

Figures (14)

  • Figure 1: Validating Psychometric Tests for LLMs. We investigate the reliability and validity of psychometric tests for LLMs, including ecological validity, i.e., the alignment between an LLM's responses to test items (e.g., for sexism) and it's behavior on a real-world downstream task (e.g., writing recommendation letters).
  • Figure 2: Reliability evaluation. We report answer consistency (i.e., the proportion of unchanged responses) across prompt variations including: (a) alternate forms, (b) reversed answer option order, and (c) changed end-of-sentence. In (a), reliability is considered acceptable if consistency across all seeds falls within the human distribution. We find that most models achieve consistency comparable to humans, with only some falling outside this range for the SR2K, indicating satisfactory reliability. In (b) and (c), higher consistency is better. We observe that the consistency for reversed answer option order (b) is notably lower than 1.0 for most LLMs, indicating low reliability. In contrast, the consistency for changed end-of sentence is mostly above 0.75 and stable across tests and seeds.
  • Figure 3: Ecological Validity Evaluation. We show Spearman’s rank correlation between psychometric test results and downstream task behavior for sexism (a), racism (b), and the moral foundation purity (c). For each model, we calculate the mean test score and the mean downstream task score across all seeds. Models are then ranked by their scores, with those exhibiting higher levels of the construct (e.g., sexism) ranked at the top, i.e., a model with rank 1 in Figure (3a) is more sexist than the model at rank 2. We find negative or weak positive correlations for all constructs, indicating that test scores do not reflect actual LLM behavior.
  • Figure 4: Prompt template. Instruction, item, answer options, and end-of-sentence (EOS; ":" vs. "?") are filled with the corresponding content depending on test, item, and prompt variation.
  • Figure 5: ASI score distribution. The bars represent the variation across the five random seeds.
  • ...and 9 more figures