Table of Contents
Fetching ...

Scalable Learning of Item Response Theory Models

Susanne Frick, Amer Krivošija, Alexander Munteanu

TL;DR

This work tackles scalable learning for Item Response Theory (IRT) models in the regime of very large $n$ examinees and $m$ items by introducing coreset-based data summarization within the standard alternating optimization framework. It leverages the close link between 2PL IRT subproblems and logistic regression, constructing provably small coresets via sensitivity sampling and leveraging-score techniques, and extends the approach to the more challenging 3PL model. The authors provide concrete sublinear coreset bounds for both 2PL and 3PL, along with an algorithmic pipeline that remains constant across iterations, yielding substantial computational savings while preserving statistical accuracy. Empirical results on synthetic data and real-world datasets (SHARE, NEPS) show significant speedups and memory reductions with only minor degradation in parameter estimates, demonstrating the practicality of scalable IRT learning for large-scale assessments and ML benchmarks. This work thus enables large-scale psychometrics and model-based evaluation tasks that were previously computationally prohibitive, and lays groundwork for applying coreset-sketching to broader IRT families and future solver improvements.

Abstract

Item Response Theory (IRT) models aim to assess latent abilities of $n$ examinees along with latent difficulty characteristics of $m$ test items from categorical data that indicates the quality of their corresponding answers. Classical psychometric assessments are based on a relatively small number of examinees and items, say a class of $200$ students solving an exam comprising $10$ problems. More recent global large scale assessments such as PISA, or internet studies, may lead to significantly increased numbers of participants. Additionally, in the context of Machine Learning where algorithms take the role of examinees and data analysis problems take the role of items, both $n$ and $m$ may become very large, challenging the efficiency and scalability of computations. To learn the latent variables in IRT models from large data, we leverage the similarity of these models to logistic regression, which can be approximated accurately using small weighted subsets called coresets. We develop coresets for their use in alternating IRT training algorithms, facilitating scalable learning from large data.

Scalable Learning of Item Response Theory Models

TL;DR

This work tackles scalable learning for Item Response Theory (IRT) models in the regime of very large examinees and items by introducing coreset-based data summarization within the standard alternating optimization framework. It leverages the close link between 2PL IRT subproblems and logistic regression, constructing provably small coresets via sensitivity sampling and leveraging-score techniques, and extends the approach to the more challenging 3PL model. The authors provide concrete sublinear coreset bounds for both 2PL and 3PL, along with an algorithmic pipeline that remains constant across iterations, yielding substantial computational savings while preserving statistical accuracy. Empirical results on synthetic data and real-world datasets (SHARE, NEPS) show significant speedups and memory reductions with only minor degradation in parameter estimates, demonstrating the practicality of scalable IRT learning for large-scale assessments and ML benchmarks. This work thus enables large-scale psychometrics and model-based evaluation tasks that were previously computationally prohibitive, and lays groundwork for applying coreset-sketching to broader IRT families and future solver improvements.

Abstract

Item Response Theory (IRT) models aim to assess latent abilities of examinees along with latent difficulty characteristics of test items from categorical data that indicates the quality of their corresponding answers. Classical psychometric assessments are based on a relatively small number of examinees and items, say a class of students solving an exam comprising problems. More recent global large scale assessments such as PISA, or internet studies, may lead to significantly increased numbers of participants. Additionally, in the context of Machine Learning where algorithms take the role of examinees and data analysis problems take the role of items, both and may become very large, challenging the efficiency and scalability of computations. To learn the latent variables in IRT models from large data, we leverage the similarity of these models to logistic regression, which can be approximated accurately using small weighted subsets called coresets. We develop coresets for their use in alternating IRT training algorithms, facilitating scalable learning from large data.
Paper Structure (32 sections, 25 theorems, 55 equations, 13 figures, 20 tables)

This paper contains 32 sections, 25 theorems, 55 equations, 13 figures, 20 tables.

Key Result

Lemma 3.1

Suppose we are given a matrix $X\in \mathbb{R}^{m\times n}$ (for any $m,n\in \mathbb{N}$) and an arbitrary diagonal matrix $D=(d_{i j})_{i\in [m], j\in [m]}$, with $d_{i j}\in \lbrace -1,1\rbrace$ if $i=j$, and $d_{i j}=0$ otherwise. Then the leverage scores of $X$ are the same as the leverage score

Figures (13)

  • Figure 1: Item Characteristic Curve examples
  • Figure 2: 2PL Experiments on real world SHARE and NEPS data: Coreset sizes vs. relative error and mean absolute deviation (MAD), cf. \ref{['tab:results_appendix3:b', 'fig:param_exp_appendix_pareto']}.
  • Figure 3: Parameter estimates for the coresets compared to the full data sets. The first row shows the bias for the item parameters $a,b$ (and $c$ for 3PL). The vertical axis is scaled to display $2\,{\mathrm{std.}}$ ($4\,{\mathrm{std.}}$ for 3PL) of the parameter estimate obtained from the full data set. The second row shows a kernel density estimate for the ability parameters $\theta$, standardized to zero mean and unit variance, with a LOESS regression line in dark green.
  • Figure 4: 2PL Experiments on synthetic data: Parameter estimates for the coresets compared to the full data sets. For each experiment the upper figure shows the bias for the item parameters $a$ and $b$. The lower figure shows a kernel density estimate for the ability parameters $\theta$ with a LOESS regression line in dark green. The ability parameters were standardized to zero mean and unit variance. In all rows, the vertical axis is scaled such as to display $2\,{\mathrm{std.}}$ of the corresponding parameter estimate obtained from the full data set.
  • Figure 5: 2PL Experiments on synthetic data: Parameter estimates for the coresets compared to the full data sets. For each experiment the upper figure shows the bias for the item parameters $a$ and $b$. The lower figure shows a kernel density estimate for the ability parameters $\theta$ with a LOESS regression line in dark green. The ability parameters were standardized to zero mean and unit variance. In all rows, the vertical axis is scaled such as to display $2\,{\mathrm{std.}}$ of the corresponding parameter estimate obtained from the full data set.
  • ...and 8 more figures

Theorems & Definitions (49)

  • Lemma 3.1
  • Theorem 3.2
  • Theorem 3.3
  • Theorem 3.4: Informal version of \ref{['thm:quality:coreset']} in \ref{['sec:quality:coreset']}
  • Definition A.1: Coreset, cf. FeldmanSS20
  • Definition A.2: Sensitivity, LangbergS10
  • Definition A.3: Range space; VC dimension
  • Definition A.4: Induced range space
  • Theorem A.5: FeldmanSS20, Theorem 31
  • Definition A.6: Leverage scores, cf. DrineasMMW12
  • ...and 39 more