Psychometric Alignment: Capturing Human Knowledge Distributions via Language Models

Joy He-Yueya; Wanjing Anya Ma; Kanishk Gandhi; Benjamin W. Domingue; Emma Brunskill; Noah D. Goodman

Psychometric Alignment: Capturing Human Knowledge Distributions via Language Models

Joy He-Yueya, Wanjing Anya Ma, Kanishk Gandhi, Benjamin W. Domingue, Emma Brunskill, Noah D. Goodman

TL;DR

This work tackles the challenge of using language models to mimic human knowledge distributions rather than simply producing correct answers. It introduces psychometric alignment, a metric based on Item Response Theory that quantifies how well LM-derived item difficulties align with human item difficulties by computing the correlation between $b$-parameters across domains. The approach is validated via psychometric simulations and applied to three real-world domains (Eedi, WordBank, Duolingo), revealing substantial misalignment that can be mitigated through persona-based prompting and domain-specific fine-tuning, with smaller models sometimes outperforming larger ones. The results highlight the importance of distribution-focused evaluation for LM-based population simulations and provide practical guidance for improving alignment in educational and policy-relevant applications. This metric can help diagnose representation gaps and inform data collection and model adaptation to better reflect human populations.

Abstract

Language models (LMs) are increasingly used to simulate human-like responses in scenarios where accurately mimicking a population's behavior can guide decision-making, such as in developing educational materials and designing public policies. The objective of these simulations is for LMs to capture the variations in human responses, rather than merely providing the expected correct answers. Prior work has shown that LMs often generate unrealistically accurate responses, but there are no established metrics to quantify how closely the knowledge distribution of LMs aligns with that of humans. To address this, we introduce "psychometric alignment," a metric that measures the extent to which LMs reflect human knowledge distribution. Assessing this alignment involves collecting responses from both LMs and humans to the same set of test items and using Item Response Theory to analyze the differences in item functioning between the groups. We demonstrate that our metric can capture important variations in populations that traditional metrics, like differences in accuracy, fail to capture. We apply this metric to assess existing LMs for their alignment with human knowledge distributions across three real-world domains. We find significant misalignment between LMs and human populations, though using persona-based prompts can improve alignment. Interestingly, smaller LMs tend to achieve greater psychometric alignment than larger LMs. Further, training LMs on human response data from the target distribution enhances their psychometric alignment on unseen test items, but the effectiveness of such training varies across domains.

Psychometric Alignment: Capturing Human Knowledge Distributions via Language Models

TL;DR

-parameters across domains. The approach is validated via psychometric simulations and applied to three real-world domains (Eedi, WordBank, Duolingo), revealing substantial misalignment that can be mitigated through persona-based prompting and domain-specific fine-tuning, with smaller models sometimes outperforming larger ones. The results highlight the importance of distribution-focused evaluation for LM-based population simulations and provide practical guidance for improving alignment in educational and policy-relevant applications. This metric can help diagnose representation gaps and inform data collection and model adaptation to better reflect human populations.

Abstract

Paper Structure (22 sections, 2 equations, 11 figures, 3 tables)

This paper contains 22 sections, 2 equations, 11 figures, 3 tables.

Introduction
Related work
Measuring psychometric alignment
Item Response Theory
Psychometric alignment metric
Datasets
The importance of psychometric alignment
Prompting-based ensemble
Control conditions
Human (positive control):
Random (negative control):
Ensembling different LMs
Persona-based prompting
Fine-tuning LMs on student response data
Limitations
...and 7 more sections

Figures (11)

Figure 1: An example of a question from the Eedi dataset.
Figure 2: Ensembling different LMs does not generate an LM population that captures the distribution of knowledge in human population from the Eedi dataset. The error bars indicate the standard deviation.
Figure 3: \ref{['fig:human_dist']} shows the person accuracy distribution and item accuracy distribution of the Eedi data. We generate a synthetic population by randomly shuffling responses within each person (\ref{['fig:synthetic_match_person_dist']}).
Figure 4: Some items that are easy for humans are hard for the Synthetic population, indicating that even when two populations show similar overall score distributions, they might possess distinct latent abilities and respond differently to the same questions. There is no significant correlation between the difficulty (1PL) parameters of the two populations (Pearson $r=0.07, p > 0.05$).
Figure 5: Example training data.
...and 6 more figures

Psychometric Alignment: Capturing Human Knowledge Distributions via Language Models

TL;DR

Abstract

Psychometric Alignment: Capturing Human Knowledge Distributions via Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (11)