Scalable Learning of Item Response Theory Models

Susanne Frick; Amer Krivošija; Alexander Munteanu

Scalable Learning of Item Response Theory Models

Susanne Frick, Amer Krivošija, Alexander Munteanu

TL;DR

This work tackles scalable learning for Item Response Theory (IRT) models in the regime of very large $n$ examinees and $m$ items by introducing coreset-based data summarization within the standard alternating optimization framework. It leverages the close link between 2PL IRT subproblems and logistic regression, constructing provably small coresets via sensitivity sampling and leveraging-score techniques, and extends the approach to the more challenging 3PL model. The authors provide concrete sublinear coreset bounds for both 2PL and 3PL, along with an algorithmic pipeline that remains constant across iterations, yielding substantial computational savings while preserving statistical accuracy. Empirical results on synthetic data and real-world datasets (SHARE, NEPS) show significant speedups and memory reductions with only minor degradation in parameter estimates, demonstrating the practicality of scalable IRT learning for large-scale assessments and ML benchmarks. This work thus enables large-scale psychometrics and model-based evaluation tasks that were previously computationally prohibitive, and lays groundwork for applying coreset-sketching to broader IRT families and future solver improvements.

Abstract

Item Response Theory (IRT) models aim to assess latent abilities of $n$ examinees along with latent difficulty characteristics of $m$ test items from categorical data that indicates the quality of their corresponding answers. Classical psychometric assessments are based on a relatively small number of examinees and items, say a class of $200$ students solving an exam comprising $10$ problems. More recent global large scale assessments such as PISA, or internet studies, may lead to significantly increased numbers of participants. Additionally, in the context of Machine Learning where algorithms take the role of examinees and data analysis problems take the role of items, both $n$ and $m$ may become very large, challenging the efficiency and scalability of computations. To learn the latent variables in IRT models from large data, we leverage the similarity of these models to logistic regression, which can be approximated accurately using small weighted subsets called coresets. We develop coresets for their use in alternating IRT training algorithms, facilitating scalable learning from large data.

Scalable Learning of Item Response Theory Models

TL;DR

This work tackles scalable learning for Item Response Theory (IRT) models in the regime of very large

examinees and

items by introducing coreset-based data summarization within the standard alternating optimization framework. It leverages the close link between 2PL IRT subproblems and logistic regression, constructing provably small coresets via sensitivity sampling and leveraging-score techniques, and extends the approach to the more challenging 3PL model. The authors provide concrete sublinear coreset bounds for both 2PL and 3PL, along with an algorithmic pipeline that remains constant across iterations, yielding substantial computational savings while preserving statistical accuracy. Empirical results on synthetic data and real-world datasets (SHARE, NEPS) show significant speedups and memory reductions with only minor degradation in parameter estimates, demonstrating the practicality of scalable IRT learning for large-scale assessments and ML benchmarks. This work thus enables large-scale psychometrics and model-based evaluation tasks that were previously computationally prohibitive, and lays groundwork for applying coreset-sketching to broader IRT families and future solver improvements.

Abstract

Item Response Theory (IRT) models aim to assess latent abilities of

examinees along with latent difficulty characteristics of

test items from categorical data that indicates the quality of their corresponding answers. Classical psychometric assessments are based on a relatively small number of examinees and items, say a class of

students solving an exam comprising

problems. More recent global large scale assessments such as PISA, or internet studies, may lead to significantly increased numbers of participants. Additionally, in the context of Machine Learning where algorithms take the role of examinees and data analysis problems take the role of items, both

and

may become very large, challenging the efficiency and scalability of computations. To learn the latent variables in IRT models from large data, we leverage the similarity of these models to logistic regression, which can be approximated accurately using small weighted subsets called coresets. We develop coresets for their use in alternating IRT training algorithms, facilitating scalable learning from large data.

Paper Structure (32 sections, 25 theorems, 55 equations, 13 figures, 20 tables)

This paper contains 32 sections, 25 theorems, 55 equations, 13 figures, 20 tables.

INTRODUCTION
Our Contributions
Related Work
Development of IRT
IRT in Machine Learning
Coresets for Logistic Regression
PRELIMINARIES
IRT Models
Coresets for the IRT Framework
Constructing Coresets
CORESETS FOR IRT MODELS
2PL Models
3PL Models
EXPERIMENTS
Experimental Setup
...and 17 more sections

Key Result

Lemma 3.1

Suppose we are given a matrix $X\in \mathbb{R}^{m\times n}$ (for any $m,n\in \mathbb{N}$) and an arbitrary diagonal matrix $D=(d_{i j})_{i\in [m], j\in [m]}$, with $d_{i j}\in \lbrace -1,1\rbrace$ if $i=j$, and $d_{i j}=0$ otherwise. Then the leverage scores of $X$ are the same as the leverage score

Figures (13)

Figure 1: Item Characteristic Curve examples
Figure 2: 2PL Experiments on real world SHARE and NEPS data: Coreset sizes vs. relative error and mean absolute deviation (MAD), cf. \ref{['tab:results_appendix3:b', 'fig:param_exp_appendix_pareto']}.
Figure 3: Parameter estimates for the coresets compared to the full data sets. The first row shows the bias for the item parameters $a,b$ (and $c$ for 3PL). The vertical axis is scaled to display $2\,{\mathrm{std.}}$ ($4\,{\mathrm{std.}}$ for 3PL) of the parameter estimate obtained from the full data set. The second row shows a kernel density estimate for the ability parameters $\theta$, standardized to zero mean and unit variance, with a LOESS regression line in dark green.
Figure 4: 2PL Experiments on synthetic data: Parameter estimates for the coresets compared to the full data sets. For each experiment the upper figure shows the bias for the item parameters $a$ and $b$. The lower figure shows a kernel density estimate for the ability parameters $\theta$ with a LOESS regression line in dark green. The ability parameters were standardized to zero mean and unit variance. In all rows, the vertical axis is scaled such as to display $2\,{\mathrm{std.}}$ of the corresponding parameter estimate obtained from the full data set.
Figure 5: 2PL Experiments on synthetic data: Parameter estimates for the coresets compared to the full data sets. For each experiment the upper figure shows the bias for the item parameters $a$ and $b$. The lower figure shows a kernel density estimate for the ability parameters $\theta$ with a LOESS regression line in dark green. The ability parameters were standardized to zero mean and unit variance. In all rows, the vertical axis is scaled such as to display $2\,{\mathrm{std.}}$ of the corresponding parameter estimate obtained from the full data set.
...and 8 more figures

Theorems & Definitions (49)

Lemma 3.1
Theorem 3.2
Theorem 3.3
Theorem 3.4: Informal version of \ref{['thm:quality:coreset']} in \ref{['sec:quality:coreset']}
Definition A.1: Coreset, cf. FeldmanSS20
Definition A.2: Sensitivity, LangbergS10
Definition A.3: Range space; VC dimension
Definition A.4: Induced range space
Theorem A.5: FeldmanSS20, Theorem 31
Definition A.6: Leverage scores, cf. DrineasMMW12
...and 39 more

Scalable Learning of Item Response Theory Models

TL;DR

Abstract

Scalable Learning of Item Response Theory Models

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (13)

Theorems & Definitions (49)