Bayesian information theoretic model-averaging stochastic item selection for computer adaptive testing
Tina Su, Edison Choe, Joshua C. Chang
TL;DR
The paper addresses item exposure in CAT by reframing next-item selection as Bayesian model averaging across candidate future-ability models. It derives a cross-entropy–based discrepancy $Δ_t^{(i)}$ and assigns sampling weights $p^{(t+1)}(i) ∝ \exp(-Δ_t^{(i)})$ to realize optimal stochastic mixing, using a mean-field VBEM to approximate the marginal posterior $q_θ(θ)$. Empirically, it evaluates the approach on the WD-FAB with eight independent IRT models, showing superior item exposure while preserving accuracy and efficiency compared to traditional greedy and stochastic selectors. The framework provides a principled link between model averaging, information-theoretic criteria, and stochastic CAT, with practical benefits for robust ability estimation and fair item usage.
Abstract
Computer Adaptive Testing (CAT) aims to accurately estimate an individual's ability using only a subset of an Item Response Theory (IRT) instrument. For many applications of CAT, one also needs to ensure diverse item exposure across different testing sessions, preventing any single item from being over or underutilized. In CAT, items are selected sequentially based on a running estimate of a respondent's ability. Prior methods almost universally see item selection through an optimization lens, motivating greedy item selection procedures. While efficient, these deterministic methods tend to have poor item exposure. Existing stochastic methods for item selection are ad-hoc, where item sampling weights lack theoretical justification. In this manuscript, we formulate stochastic CAT as a Bayesian model averaging problem. We seek item sampling probabilities, treated in the long run frequentist sense, that perform optimal model averaging for the ability estimate in a Bayesian sense. In doing so we derive a cross-entropy information criterion that yields optimal stochastic mixing. We tested our new method on the eight independent IRT models that comprise the Work Disability Functional Assessment Battery, comparing it to prior art. We found that our stochastic methodology had superior item exposure while not compromising in terms of test accuracy and efficiency.
