Table of Contents
Fetching ...

Bayesian information theoretic model-averaging stochastic item selection for computer adaptive testing

Tina Su, Edison Choe, Joshua C. Chang

TL;DR

The paper addresses item exposure in CAT by reframing next-item selection as Bayesian model averaging across candidate future-ability models. It derives a cross-entropy–based discrepancy $Δ_t^{(i)}$ and assigns sampling weights $p^{(t+1)}(i) ∝ \exp(-Δ_t^{(i)})$ to realize optimal stochastic mixing, using a mean-field VBEM to approximate the marginal posterior $q_θ(θ)$. Empirically, it evaluates the approach on the WD-FAB with eight independent IRT models, showing superior item exposure while preserving accuracy and efficiency compared to traditional greedy and stochastic selectors. The framework provides a principled link between model averaging, information-theoretic criteria, and stochastic CAT, with practical benefits for robust ability estimation and fair item usage.

Abstract

Computer Adaptive Testing (CAT) aims to accurately estimate an individual's ability using only a subset of an Item Response Theory (IRT) instrument. For many applications of CAT, one also needs to ensure diverse item exposure across different testing sessions, preventing any single item from being over or underutilized. In CAT, items are selected sequentially based on a running estimate of a respondent's ability. Prior methods almost universally see item selection through an optimization lens, motivating greedy item selection procedures. While efficient, these deterministic methods tend to have poor item exposure. Existing stochastic methods for item selection are ad-hoc, where item sampling weights lack theoretical justification. In this manuscript, we formulate stochastic CAT as a Bayesian model averaging problem. We seek item sampling probabilities, treated in the long run frequentist sense, that perform optimal model averaging for the ability estimate in a Bayesian sense. In doing so we derive a cross-entropy information criterion that yields optimal stochastic mixing. We tested our new method on the eight independent IRT models that comprise the Work Disability Functional Assessment Battery, comparing it to prior art. We found that our stochastic methodology had superior item exposure while not compromising in terms of test accuracy and efficiency.

Bayesian information theoretic model-averaging stochastic item selection for computer adaptive testing

TL;DR

The paper addresses item exposure in CAT by reframing next-item selection as Bayesian model averaging across candidate future-ability models. It derives a cross-entropy–based discrepancy and assigns sampling weights to realize optimal stochastic mixing, using a mean-field VBEM to approximate the marginal posterior . Empirically, it evaluates the approach on the WD-FAB with eight independent IRT models, showing superior item exposure while preserving accuracy and efficiency compared to traditional greedy and stochastic selectors. The framework provides a principled link between model averaging, information-theoretic criteria, and stochastic CAT, with practical benefits for robust ability estimation and fair item usage.

Abstract

Computer Adaptive Testing (CAT) aims to accurately estimate an individual's ability using only a subset of an Item Response Theory (IRT) instrument. For many applications of CAT, one also needs to ensure diverse item exposure across different testing sessions, preventing any single item from being over or underutilized. In CAT, items are selected sequentially based on a running estimate of a respondent's ability. Prior methods almost universally see item selection through an optimization lens, motivating greedy item selection procedures. While efficient, these deterministic methods tend to have poor item exposure. Existing stochastic methods for item selection are ad-hoc, where item sampling weights lack theoretical justification. In this manuscript, we formulate stochastic CAT as a Bayesian model averaging problem. We seek item sampling probabilities, treated in the long run frequentist sense, that perform optimal model averaging for the ability estimate in a Bayesian sense. In doing so we derive a cross-entropy information criterion that yields optimal stochastic mixing. We tested our new method on the eight independent IRT models that comprise the Work Disability Functional Assessment Battery, comparing it to prior art. We found that our stochastic methodology had superior item exposure while not compromising in terms of test accuracy and efficiency.

Paper Structure

This paper contains 16 sections, 12 equations, 9 figures, 1 table.

Figures (9)

  • Figure 1: Ability estimate discrepancy $\mathcal{D}(\pi(\theta|\mathbf{x}) \parallel \pi(\theta|\mathbf{x}_t)$) (mean and middle 80% interval) conditional on score $\theta$ used to generate response sets, by scale, item selection method, and test length $t$, for mental function scales of the WD-FAB. Lower is better.
  • Figure 2: Absolute error in means ($|\int\theta\pi(\theta|\mathbf{x}_t)\mathrm{d}\theta - \int\theta\pi(\theta|\mathbf{x})\mathrm{d}\theta|$) (mean and middle 80% interval) conditional on true score $\theta$ by scale, item selection method, and test length $t$, for mental function scales of the WD-FAB. Lower is better.
  • Figure 3: Standard deviation of ability estimates ($\sqrt{\textrm{Var}_{t}(\theta}$) (mean and middle 80% percentile) conditional on true score $\theta$ by scale and item selection method, for mental function scales of the WD-FAB. Used as stopping criteria for CAT. Lower is better.
  • Figure 4: Item exposure statistics (mean and middle 80% interval), for each of the given item selection methods across a given number of CAT sessions, for mental function scales of the WD-FAB. The dashed line represents the maximum possible exposure per scale. Higher is better.
  • Figure S1: Model discrepancy $\mathcal{D}(\pi(\theta|\mathbf{x}) \parallel \pi(\theta|\mathbf{x}_t)$) (mean and middle 80% interval) conditional on score $\theta$ used to generate response sets, by scale, item selection method, and test length $t$, for physical function scales of the WD-FAB. Lower is better.
  • ...and 4 more figures