Leveraging LLM-Respondents for Item Evaluation: a Psychometric Analysis

Yunting Liu; Shreya Bhandari; Zachary A. Pardos

Leveraging LLM-Respondents for Item Evaluation: a Psychometric Analysis

Yunting Liu, Shreya Bhandari, Zachary A. Pardos

TL;DR

Results show that some LLMs have comparable or higher proficiency in College Algebra than college students, and no single LLM mimics human respondents due to narrow proficiency distributions, but an ensemble of LLMs can better resemble college students' ability distribution.

Abstract

Effective educational measurement relies heavily on the curation of well-designed item pools (i.e., possessing the right psychometric properties). However, item calibration is time-consuming and costly, requiring a sufficient number of respondents for the response process. We explore using six different LLMs (GPT-3.5, GPT-4, Llama 2, Llama 3, Gemini-Pro, and Cohere Command R Plus) and various combinations of them using sampling methods to produce responses with psychometric properties similar to human answers. Results show that some LLMs have comparable or higher proficiency in College Algebra than college students. No single LLM mimics human respondents due to narrow proficiency distributions, but an ensemble of LLMs can better resemble college students' ability distribution. The item parameters calibrated by LLM-Respondents have high correlations (e.g. > 0.8 for GPT-3.5) compared to their human calibrated counterparts, and closely resemble the parameters of the human subset (e.g. 0.02 Spearman correlation difference). Several augmentation strategies are evaluated for their relative performance, with resampling methods proving most effective, enhancing the Spearman correlation from 0.89 (human only) to 0.93 (augmented human).

Leveraging LLM-Respondents for Item Evaluation: a Psychometric Analysis

TL;DR

Abstract

Paper Structure (15 sections, 1 equation, 3 figures, 2 tables)

This paper contains 15 sections, 1 equation, 3 figures, 2 tables.

Introduction
Related Work
Simulated Data in Educational Measurement and Educational Data Mining
Data Augmentation
OER and automation
Methods
Model Selection
Selection of Items and Prompt Engineering
Augmentation Procedure
IRT analysis
Results
LLM-Respondent Simulation
Data Augmentation using LLM-Respondent
Discussion and Conclusions
Limitations and Future Work

Figures (3)

Figure 1: Item parameters calibrated by human respondents
Figure 2: Proficiency distribution by Generating Models
Figure 3: Proficiency distribution by Augmentation Experiments

Leveraging LLM-Respondents for Item Evaluation: a Psychometric Analysis

TL;DR

Abstract

Leveraging LLM-Respondents for Item Evaluation: a Psychometric Analysis

Authors

TL;DR

Abstract

Table of Contents

Figures (3)