Cramming Contextual Bandits for On-policy Statistical Evaluation

Zeyang Jia; Kosuke Imai; Michael Lingzhi Li

Cramming Contextual Bandits for On-policy Statistical Evaluation

Zeyang Jia, Kosuke Imai, Michael Lingzhi Li

TL;DR

This work addresses the challenge of evaluating the final learned policy of a contextual bandit algorithm using the same adaptively collected data. It introduces cram, a general on-policy evaluation framework that telescopes policy-value improvements across successive updates and uses IPW-based estimates on future data to obtain a single, computationally efficient final policy value estimator. Under a mild stability condition for the learning algorithm, the crammed estimator is consistent and asymptotically normal, with a consistently estimable variance, enabling valid confidence intervals; the stability condition is shown to hold for standard linear contextual bandits, including $\epsilon$-greedy, Thompson Sampling, and UCB. Empirical results on synthetic and real-world datasets show that cram reduces evaluation RMSE by about 40% relative to off-policy sample-splitting methods while maintaining unbiasedness and proper CI coverage, suggesting substantial practical gains for evaluating deployed bandit policies.

Abstract

We introduce the cram method as a general statistical framework for evaluating the final learned policy from a multi-armed contextual bandit algorithm, using the dataset generated by the same bandit algorithm. The proposed on-policy evaluation methodology differs from most existing methods that focus on off-policy performance evaluation of contextual bandit algorithms. Cramming utilizes an entire bandit sequence through a single pass of data, leading to both statistically and computationally efficient evaluation. We prove that if a bandit algorithm satisfies a certain stability condition, the resulting crammed evaluation estimator is consistent and asymptotically normal under mild regularity conditions. Furthermore, we show that this stability condition holds for commonly used linear contextual bandit algorithms, including epsilon-greedy, Thompson Sampling, and Upper Confidence Bound algorithms. Using both synthetic and publicly available datasets, we compare the empirical performance of cramming with the state-of-the-art methods. The results demonstrate that the proposed cram method reduces the evaluation standard error by approximately 40% relative to off-policy evaluation methods while preserving unbiasedness and valid confidence interval coverage.

Cramming Contextual Bandits for On-policy Statistical Evaluation

TL;DR

-greedy, Thompson Sampling, and UCB. Empirical results on synthetic and real-world datasets show that cram reduces evaluation RMSE by about 40% relative to off-policy sample-splitting methods while maintaining unbiasedness and proper CI coverage, suggesting substantial practical gains for evaluating deployed bandit policies.

Abstract

Paper Structure (53 sections, 14 theorems, 254 equations, 6 figures, 5 algorithms)

This paper contains 53 sections, 14 theorems, 254 equations, 6 figures, 5 algorithms.

Introduction
Problem formulation
The cram method
Statistical inference after cramming
Crammed policy evaluation estimator
Assumptions on a bandit algorithm
Consistency and asymptotic normality
Common linear bandit algorithms
Numerical experiments
Synthetic data
Real-world data
Concluding remarks
Proof of Theorem \ref{['thm:L1Consistency']}
Proof of Theorem \ref{['thm:Asymptotic_Normality']}
Proof of Condition \ref{['cond:vt_lower_bound']}
...and 38 more sections

Key Result

Theorem 1

Suppose that a sequence of learned policies $\{\hat{\pi}_t\}_{t=1}^T$ satisfies Assumption ass:learning_rate. Then, under Assumptions ass:stationarity, ass:clip_rate, ass:bounded, and ass:fourth_moment, we have,

Figures (6)

Figure 1: A schematic illustration of the cram method. A contextual bandit algorithm uses the first batch of data to obtain the first learned policy $\hat{\pi}_1$, and we estimate the value of this policy using the remaining $T-1$ observations, denoted by $V(\hat{\pi}_1)$. Next, the bandit algorithm updates this learned policy with the second batch, yielding the updated learned policy $\hat{\pi}_2$. We then estimate the value difference between these two policies, $\Delta(\hat{\pi}_2, \hat{\pi}_1)=V(\hat{\pi}_2)-V(\hat{\pi}_1)$, using the remaining $T-2$ batches of data. Repeating this update-and-test process leads to the final learned policy $\hat{\pi}_{T-1}$, and the evaluation of final performance improvement $\Delta(\hat{\pi}_{T-1}, \hat{\pi}_{T-2})$ based on the last $T$th batch. Finally, summing these performance difference estimates yields the estimated value of the final learned policy.
Figure 2: Learning and evaluation performance of cram and sample splitting for synthetic data. For sample splitting, we use non-contextual variance stabilization weights of zhan2021off. The parameters are set to $T=100$ (sample size), $\eta=0.5$ (clipping rate), and $\beta=0.5$ (signal strength). See Appendix \ref{['app:additional_simulation']} for additional parameter settings. For sample splitting, we consider 80--20% and 60--40% train-and-test splits. The figure shows policy value relative to the oracle (a. top left), bias relative to true policy value (b. top right), RMSE relative to cram (c. bottom left), and empirical coverage of 95% confidence intervals (d. bottom right).
Figure 3: Learning and evaluation performance of cram and sample splitting for real-world data. For sample splitting, we use non-contextual variance stabilization weights of zhan2021off, and we consider 80--20% train-and-test split. We use AWAIPW-NS weights as an example and the results for other weights are similar. The figure shows policy value based on sample splitting relative to cram (a. top left), RMSE of sample splitting relative to cram (b. top right), and empirical coverage of 95% confidence intervals for $\epsilon$ greedy (c. bottom left) and UCB (d. bottom right).
Figure S1: Evaluation performance as a function of bandit sequence length $T$. The data are collected with Thompson Sampling Algorithm. We compare cram evaluation v.s. 80--20% Sample Splitting.
Figure S2: Performance of the evaluation for different bandit algorithm and signal size. The bar plot (left y-axis) shows the RMSE, and the line plot (right y-axis) shows the coverage of the 95% confidence interval. The decay rate $\eta$ is set to 0.
...and 1 more figures

Theorems & Definitions (16)

Definition 1: The crammed IPW policy evaluation estimator
Theorem 1: consistency
Theorem 2: Asymptotic normality
Definition 2: The crammed variance estimator
Theorem 3: Consistency of the crammed variance estimator.
Corollary 1: Asymptotic confidence intervals
Corollary 2: Asymptotically negligible final policy difference
Theorem 4: Stability condition under linear contextual bandit algorithms
Lemma 1
Lemma 2
...and 6 more

Cramming Contextual Bandits for On-policy Statistical Evaluation

TL;DR

Abstract

Cramming Contextual Bandits for On-policy Statistical Evaluation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (16)