Table of Contents
Fetching ...

The creative psychometric item generator: a framework for item generation and validation using large language models

Antonio Laverghetta, Simone Luchini, Averie Linell, Roni Reiter-Palmon, Roger Beaty

TL;DR

This work develops a psychometrically inspired framework for creating test items for a classic free-response creativity test: the creative problem-solving (CPS) task, and finds strong empirical evidence that CPIG generates valid and reliable items.

Abstract

Increasingly, large language models (LLMs) are being used to automate workplace processes requiring a high degree of creativity. While much prior work has examined the creativity of LLMs, there has been little research on whether they can generate valid creativity assessments for humans despite the increasingly central role of creativity in modern economies. We develop a psychometrically inspired framework for creating test items (questions) for a classic free-response creativity test: the creative problem-solving (CPS) task. Our framework, the creative psychometric item generator (CPIG), uses a mixture of LLM-based item generators and evaluators to iteratively develop new prompts for writing CPS items, such that items from later iterations will elicit more creative responses from test takers. We find strong empirical evidence that CPIG generates valid and reliable items and that this effect is not attributable to known biases in the evaluation process. Our findings have implications for employing LLMs to automatically generate valid and reliable creativity tests for humans and AI.

The creative psychometric item generator: a framework for item generation and validation using large language models

TL;DR

This work develops a psychometrically inspired framework for creating test items for a classic free-response creativity test: the creative problem-solving (CPS) task, and finds strong empirical evidence that CPIG generates valid and reliable items.

Abstract

Increasingly, large language models (LLMs) are being used to automate workplace processes requiring a high degree of creativity. While much prior work has examined the creativity of LLMs, there has been little research on whether they can generate valid creativity assessments for humans despite the increasingly central role of creativity in modern economies. We develop a psychometrically inspired framework for creating test items (questions) for a classic free-response creativity test: the creative problem-solving (CPS) task. Our framework, the creative psychometric item generator (CPIG), uses a mixture of LLM-based item generators and evaluators to iteratively develop new prompts for writing CPS items, such that items from later iterations will elicit more creative responses from test takers. We find strong empirical evidence that CPIG generates valid and reliable items and that this effect is not attributable to known biases in the evaluation process. Our findings have implications for employing LLMs to automatically generate valid and reliable creativity tests for humans and AI.
Paper Structure (18 sections, 2 equations, 6 figures, 1 table)

This paper contains 18 sections, 2 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: Overview of CPIG. From a base instruction, we prompt an LLM to generate CPS items, which are, in turn, completed by other LLMs. We give each LLM response generator a distinct profile to increase variability in the originality of their solutions. These responses are scored with an originality model developed by luchini2023automatic, and a subset of the generated items with highly original responses are selected to include in the prompt for the next round of item generation. This figure was designed using images from Flaticon.com.
  • Figure 2: Mean originality scores from each item generator on the first and last rounds, for all trials that did not use random shot selection. Error bars are standard deviations in scores. Higher values indicate more original item responses, on average.
  • Figure 3: Pearson correlation between item response length and originality score. Length is calculated using the NLTK word tokenizer.
  • Figure 4: Joint histogram of originality and similarity scores for round five items. The highest quality items are those in the bottom right region. Note that we have dropped all items whose cosine similarity was greater than $0.95$ to any other item.
  • Figure 5: Distributions of originality (a) and similarity (b) scores, broken down by prompt types and shot selection strategy.
  • ...and 1 more figures