Psychometric Alignment: Capturing Human Knowledge Distributions via Language Models
Joy He-Yueya, Wanjing Anya Ma, Kanishk Gandhi, Benjamin W. Domingue, Emma Brunskill, Noah D. Goodman
TL;DR
This work tackles the challenge of using language models to mimic human knowledge distributions rather than simply producing correct answers. It introduces psychometric alignment, a metric based on Item Response Theory that quantifies how well LM-derived item difficulties align with human item difficulties by computing the correlation between $b$-parameters across domains. The approach is validated via psychometric simulations and applied to three real-world domains (Eedi, WordBank, Duolingo), revealing substantial misalignment that can be mitigated through persona-based prompting and domain-specific fine-tuning, with smaller models sometimes outperforming larger ones. The results highlight the importance of distribution-focused evaluation for LM-based population simulations and provide practical guidance for improving alignment in educational and policy-relevant applications. This metric can help diagnose representation gaps and inform data collection and model adaptation to better reflect human populations.
Abstract
Language models (LMs) are increasingly used to simulate human-like responses in scenarios where accurately mimicking a population's behavior can guide decision-making, such as in developing educational materials and designing public policies. The objective of these simulations is for LMs to capture the variations in human responses, rather than merely providing the expected correct answers. Prior work has shown that LMs often generate unrealistically accurate responses, but there are no established metrics to quantify how closely the knowledge distribution of LMs aligns with that of humans. To address this, we introduce "psychometric alignment," a metric that measures the extent to which LMs reflect human knowledge distribution. Assessing this alignment involves collecting responses from both LMs and humans to the same set of test items and using Item Response Theory to analyze the differences in item functioning between the groups. We demonstrate that our metric can capture important variations in populations that traditional metrics, like differences in accuracy, fail to capture. We apply this metric to assess existing LMs for their alignment with human knowledge distributions across three real-world domains. We find significant misalignment between LMs and human populations, though using persona-based prompts can improve alignment. Interestingly, smaller LMs tend to achieve greater psychometric alignment than larger LMs. Further, training LMs on human response data from the target distribution enhances their psychometric alignment on unseen test items, but the effectiveness of such training varies across domains.
