Quantifying Data Contamination in Psychometric Evaluations of LLMs
Jongwook Han, Woojung Song, Jonggeun Lee, Yohan Jo
TL;DR
The paper investigates the reliability of psychometric evaluations when applied to Large Language Models (LLMs) by examining data contamination from training exposure. It introduces a framework that quantifies contamination across three facets—item memorization, evaluation memorization, and target score matching—and applies it to 21 models across four inventories (BFI-44, PVQ-40, MFQ, SD-3). The study provides systematic evidence that LLMs not only memorize inventory items and scoring rules but can also manipulate responses to achieve target scores, with stronger contamination observed in the PVQ-40 and BFI-44 inventories and with larger model sizes. These findings underscore the need for contamination-aware evaluation protocols and have significant implications for interpreting psychometric assessments of LLMs and for the design of future benchmarking in computational psychology. $AED$, $F1$, and $MAE$ metrics are used to quantify verbatim memorization, item-dimension mapping accuracy, and target-score manipulation across inventories and models.
Abstract
Recent studies apply psychometric questionnaires to Large Language Models (LLMs) to assess high-level psychological constructs such as values, personality, moral foundations, and dark traits. Although prior work has raised concerns about possible data contamination from psychometric inventories, which may threaten the reliability of such evaluations, there has been no systematic attempt to quantify the extent of this contamination. To address this gap, we propose a framework to systematically measure data contamination in psychometric evaluations of LLMs, evaluating three aspects: (1) item memorization, (2) evaluation memorization, and (3) target score matching. Applying this framework to 21 models from major families and four widely used psychometric inventories, we provide evidence that popular inventories such as the Big Five Inventory (BFI-44) and Portrait Values Questionnaire (PVQ-40) exhibit strong contamination, where models not only memorize items but can also adjust their responses to achieve specific target scores.
