Table of Contents
Fetching ...

AutoIRT: Calibrating Item Response Theory Models with Automated Machine Learning

James Sharpnack, Phoebe Mulcaire, Klinton Bicknell, Geoff LaFlair, Kevin Yancey

TL;DR

A multistage fitting procedure that is compatible with out-of-the-box Automated Machine Learning (AutoML) tools is proposed, based on a Monte Carlo EM outer loop with a two stage inner loop, which trains a non-parametric AutoML grade model using item features followed by an item specific parametric model.

Abstract

Item response theory (IRT) is a class of interpretable factor models that are widely used in computerized adaptive tests (CATs), such as language proficiency tests. Traditionally, these are fit using parametric mixed effects models on the probability of a test taker getting the correct answer to a test item (i.e., question). Neural net extensions of these models, such as BertIRT, require specialized architectures and parameter tuning. We propose a multistage fitting procedure that is compatible with out-of-the-box Automated Machine Learning (AutoML) tools. It is based on a Monte Carlo EM (MCEM) outer loop with a two stage inner loop, which trains a non-parametric AutoML grade model using item features followed by an item specific parametric model. This greatly accelerates the modeling workflow for scoring tests. We demonstrate its effectiveness by applying it to the Duolingo English Test, a high stakes, online English proficiency test. We show that the resulting model is typically more well calibrated, gets better predictive performance, and more accurate scores than existing methods (non-explanatory IRT models and explanatory IRT models like BERT-IRT). Along the way, we provide a brief survey of machine learning methods for calibration of item parameters for CATs.

AutoIRT: Calibrating Item Response Theory Models with Automated Machine Learning

TL;DR

A multistage fitting procedure that is compatible with out-of-the-box Automated Machine Learning (AutoML) tools is proposed, based on a Monte Carlo EM outer loop with a two stage inner loop, which trains a non-parametric AutoML grade model using item features followed by an item specific parametric model.

Abstract

Item response theory (IRT) is a class of interpretable factor models that are widely used in computerized adaptive tests (CATs), such as language proficiency tests. Traditionally, these are fit using parametric mixed effects models on the probability of a test taker getting the correct answer to a test item (i.e., question). Neural net extensions of these models, such as BertIRT, require specialized architectures and parameter tuning. We propose a multistage fitting procedure that is compatible with out-of-the-box Automated Machine Learning (AutoML) tools. It is based on a Monte Carlo EM (MCEM) outer loop with a two stage inner loop, which trains a non-parametric AutoML grade model using item features followed by an item specific parametric model. This greatly accelerates the modeling workflow for scoring tests. We demonstrate its effectiveness by applying it to the Duolingo English Test, a high stakes, online English proficiency test. We show that the resulting model is typically more well calibrated, gets better predictive performance, and more accurate scores than existing methods (non-explanatory IRT models and explanatory IRT models like BERT-IRT). Along the way, we provide a brief survey of machine learning methods for calibration of item parameters for CATs.
Paper Structure (16 sections, 16 equations, 8 figures, 5 tables, 1 algorithm)

This paper contains 16 sections, 16 equations, 8 figures, 5 tables, 1 algorithm.

Figures (8)

  • Figure 1: Example IRFs for two Y/N Vocab words fitted using AutoIRT, $\theta$ is on a $(0,10)$ scale. The item "ganderby" has $a = 0.666, d = 3.969, c = 0.25$, and "landlord" has $a=1.021, d=2.572, c=0.25$.
  • Figure 2: Simulation of item parameters, discrimination (a, top) and difficulty (d, bottom) as a function of the item features. Each dot constitutes an item.
  • Figure 3: Warm-start evaluation on simulated held-out sessions: the correlation between the average grade across items and the average predicted grades (top) and the correlation between the score and latent ability $\theta$ (bottom).
  • Figure 4: Cold-start evaluation on simulation held-out items: the correlation between the average grade across items and the average predicted grades (top) and the test loss for those items and sessions that are held-out (bottom).
  • Figure 5: Calibration comparison of three methods for Y/N Vocab in Jump-start 20 setting; the mean grades and pred. prob. are averages grouped by $\theta$ %ile. These two means should match for a well calibrated model.
  • ...and 3 more figures