AutoIRT: Calibrating Item Response Theory Models with Automated Machine Learning

James Sharpnack; Phoebe Mulcaire; Klinton Bicknell; Geoff LaFlair; Kevin Yancey

AutoIRT: Calibrating Item Response Theory Models with Automated Machine Learning

James Sharpnack, Phoebe Mulcaire, Klinton Bicknell, Geoff LaFlair, Kevin Yancey

TL;DR

A multistage fitting procedure that is compatible with out-of-the-box Automated Machine Learning (AutoML) tools is proposed, based on a Monte Carlo EM outer loop with a two stage inner loop, which trains a non-parametric AutoML grade model using item features followed by an item specific parametric model.

Abstract

Item response theory (IRT) is a class of interpretable factor models that are widely used in computerized adaptive tests (CATs), such as language proficiency tests. Traditionally, these are fit using parametric mixed effects models on the probability of a test taker getting the correct answer to a test item (i.e., question). Neural net extensions of these models, such as BertIRT, require specialized architectures and parameter tuning. We propose a multistage fitting procedure that is compatible with out-of-the-box Automated Machine Learning (AutoML) tools. It is based on a Monte Carlo EM (MCEM) outer loop with a two stage inner loop, which trains a non-parametric AutoML grade model using item features followed by an item specific parametric model. This greatly accelerates the modeling workflow for scoring tests. We demonstrate its effectiveness by applying it to the Duolingo English Test, a high stakes, online English proficiency test. We show that the resulting model is typically more well calibrated, gets better predictive performance, and more accurate scores than existing methods (non-explanatory IRT models and explanatory IRT models like BERT-IRT). Along the way, we provide a brief survey of machine learning methods for calibration of item parameters for CATs.

AutoIRT: Calibrating Item Response Theory Models with Automated Machine Learning

TL;DR

Abstract

Paper Structure (16 sections, 16 equations, 8 figures, 5 tables, 1 algorithm)

This paper contains 16 sections, 16 equations, 8 figures, 5 tables, 1 algorithm.

Introduction
Machine Learning Approaches to IRT
Item Response Theory and Test Scoring
Feature-based IRT Models and BERT-IRT
Online English Proficiency Testing
Method
Item Features
Monte Carlo Expectation Maximization
AutoIRT M-step
Evaluating Calibrated Item Parameters
Experimental Results
Simulation study
Offline calibration analysis for DET
Offline and online reliability analysis for DET
Conclusions
...and 1 more sections

Figures (8)

Figure 1: Example IRFs for two Y/N Vocab words fitted using AutoIRT, $\theta$ is on a $(0,10)$ scale. The item "ganderby" has $a = 0.666, d = 3.969, c = 0.25$, and "landlord" has $a=1.021, d=2.572, c=0.25$.
Figure 2: Simulation of item parameters, discrimination (a, top) and difficulty (d, bottom) as a function of the item features. Each dot constitutes an item.
Figure 3: Warm-start evaluation on simulated held-out sessions: the correlation between the average grade across items and the average predicted grades (top) and the correlation between the score and latent ability $\theta$ (bottom).
Figure 4: Cold-start evaluation on simulation held-out items: the correlation between the average grade across items and the average predicted grades (top) and the test loss for those items and sessions that are held-out (bottom).
Figure 5: Calibration comparison of three methods for Y/N Vocab in Jump-start 20 setting; the mean grades and pred. prob. are averages grouped by $\theta$ %ile. These two means should match for a well calibrated model.
...and 3 more figures

AutoIRT: Calibrating Item Response Theory Models with Automated Machine Learning

TL;DR

Abstract

AutoIRT: Calibrating Item Response Theory Models with Automated Machine Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (8)