Table of Contents
Fetching ...

Prompt Engineering for Scale Development in Generative Psychometrics

Lara Lee Russell-Lasalandra, Hudson Golino

Abstract

This Monte Carlo simulation examines how prompt engineering strategies shape the quality of large language model (LLM)--generated personality assessment items within the AI-GENIE framework for generative psychometrics. Item pools targeting the Big Five traits were generated using multiple prompting designs (zero-shot, few-shot, persona-based, and adaptive), model temperatures, and LLMs, then evaluated and reduced using network psychometric methods. Across all conditions, AI-GENIE reliably improved structural validity following reduction, with the magnitude of its incremental contribution inversely related to the quality of the incoming item pool. Prompt design exerted a substantial influence on both pre- and post-reduction item quality. Adaptive prompting consistently outperformed non-adaptive strategies by sharply reducing semantic redundancy, elevating pre-reduction structural validity, and preserving substantially larger item pool, particularly when paired with newer, higher-capacity models. These gains were robust across temperature settings for most models, indicating that adaptive prompting mitigates common trade-offs between creativity and psychometric coherence. An exception was observed for the GPT-4o model at high temperatures, suggesting model-specific sensitivity to adaptive constraints at elevated stochasticity. Overall, the findings demonstrate that adaptive prompting is the strongest approach in this context, and that its benefits scale with model capability, motivating continued investigation of model--prompt interactions in generative psychometric pipelines.

Prompt Engineering for Scale Development in Generative Psychometrics

Abstract

This Monte Carlo simulation examines how prompt engineering strategies shape the quality of large language model (LLM)--generated personality assessment items within the AI-GENIE framework for generative psychometrics. Item pools targeting the Big Five traits were generated using multiple prompting designs (zero-shot, few-shot, persona-based, and adaptive), model temperatures, and LLMs, then evaluated and reduced using network psychometric methods. Across all conditions, AI-GENIE reliably improved structural validity following reduction, with the magnitude of its incremental contribution inversely related to the quality of the incoming item pool. Prompt design exerted a substantial influence on both pre- and post-reduction item quality. Adaptive prompting consistently outperformed non-adaptive strategies by sharply reducing semantic redundancy, elevating pre-reduction structural validity, and preserving substantially larger item pool, particularly when paired with newer, higher-capacity models. These gains were robust across temperature settings for most models, indicating that adaptive prompting mitigates common trade-offs between creativity and psychometric coherence. An exception was observed for the GPT-4o model at high temperatures, suggesting model-specific sensitivity to adaptive constraints at elevated stochasticity. Overall, the findings demonstrate that adaptive prompting is the strongest approach in this context, and that its benefits scale with model capability, motivating continued investigation of model--prompt interactions in generative psychometric pipelines.
Paper Structure (24 sections, 7 figures, 2 tables)

This paper contains 24 sections, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Average NMI before and after AI-GENIE reduction across prompting conditions, models, and temperatures. AI-GENIE reliably improved NMI across virtually all conditions. Adaptive prompting substantially raised the pre-reduction NMI floor for the newest models, with GPT-4o at temperature 1.5 as the sole condition where post-reduction NMI fell below the non-adaptive baseline.
  • Figure 2: The average number of items removed during the UVA step of AI-GENIE relative to the basic prompt condition. The redundancy reduction is most notable for the adaptive prompting condition, which shows substantial gains over the baseline.
  • Figure 3: Average number of items removed at the UVA redundancy step across prompting conditions, models, and temperatures. Adaptive prompting (PER+FS+Adaptive) produced dramatically fewer removals than all other conditions, with reductions most pronounced for the newest models.
  • Figure 4: The average NMI before AI-GENIE reduction relative to the basic prompt condition. Adaptive prompting produced improvements in accuracy over the baseline for GPT-5.1 and GPT-OSS-120B. On the other hand, GPT-OSS-20B shows very modest gains while GPT-4o's high and default temperature models show a dip in the initial NMI.
  • Figure 5: The average NMI after AI-GENIE implementation relative to the basic prompt condition. Adaptive prompting produced improvements in accuracy over the baseline for GPT-5.1, GPT-OSS-120B, and GPT-OSS-20B. However, GPT-4o's high temperature model provided a notable exception as adaptive prompting showed a dip in the final NMI.
  • ...and 2 more figures