Cramming Contextual Bandits for On-policy Statistical Evaluation
Zeyang Jia, Kosuke Imai, Michael Lingzhi Li
TL;DR
This work addresses the challenge of evaluating the final learned policy of a contextual bandit algorithm using the same adaptively collected data. It introduces cram, a general on-policy evaluation framework that telescopes policy-value improvements across successive updates and uses IPW-based estimates on future data to obtain a single, computationally efficient final policy value estimator. Under a mild stability condition for the learning algorithm, the crammed estimator is consistent and asymptotically normal, with a consistently estimable variance, enabling valid confidence intervals; the stability condition is shown to hold for standard linear contextual bandits, including $\epsilon$-greedy, Thompson Sampling, and UCB. Empirical results on synthetic and real-world datasets show that cram reduces evaluation RMSE by about 40% relative to off-policy sample-splitting methods while maintaining unbiasedness and proper CI coverage, suggesting substantial practical gains for evaluating deployed bandit policies.
Abstract
We introduce the cram method as a general statistical framework for evaluating the final learned policy from a multi-armed contextual bandit algorithm, using the dataset generated by the same bandit algorithm. The proposed on-policy evaluation methodology differs from most existing methods that focus on off-policy performance evaluation of contextual bandit algorithms. Cramming utilizes an entire bandit sequence through a single pass of data, leading to both statistically and computationally efficient evaluation. We prove that if a bandit algorithm satisfies a certain stability condition, the resulting crammed evaluation estimator is consistent and asymptotically normal under mild regularity conditions. Furthermore, we show that this stability condition holds for commonly used linear contextual bandit algorithms, including epsilon-greedy, Thompson Sampling, and Upper Confidence Bound algorithms. Using both synthetic and publicly available datasets, we compare the empirical performance of cramming with the state-of-the-art methods. The results demonstrate that the proposed cram method reduces the evaluation standard error by approximately 40% relative to off-policy evaluation methods while preserving unbiasedness and valid confidence interval coverage.
