Anytime-valid off-policy inference for contextual bandits

Ian Waudby-Smith; Lili Wu; Aaditya Ramdas; Nikos Karampatziakis; Paul Mineiro

Anytime-valid off-policy inference for contextual bandits

Ian Waudby-Smith, Lili Wu, Aaditya Ramdas, Nikos Karampatziakis, Paul Mineiro

TL;DR

This work tackles off-policy evaluation in contextual bandits under adaptive, sequential data collection by developing anytime-valid confidence sequences (CSs) that require only nonparametric assumptions and accommodate unpredictable logging policies. It introduces doubly robust pseudo-outcomes to tighten CSs for fixed policy values, derives closed-form and fixed-time CIs, and extends to time-varying policy values with both empirical Bernstein and iterated-logarithm CSs. The paper also constructs time-uniform, quantile-uniform bands for the off-policy CDF and connects the OPE framework to causal inference in adaptive experiments, including sequential testing with anytime p-values. Collectively, these results enable robust, stop-time-valid, nonparametric inference for policy evaluation and distributional properties in dynamically evolving contextual-bandit settings, with extensions to FDR control and privacy-preserving OPE. The techniques rely on martingale-based constructions that remain valid without knowledge of the maximal importance weight and adapt to empirical variance, making them practical for real-time decision making and gated deployment scenarios.

Abstract

Contextual bandit algorithms are ubiquitous tools for active sequential experimentation in healthcare and the tech industry. They involve online learning algorithms that adaptively learn policies over time to map observed contexts $X_t$ to actions $A_t$ in an attempt to maximize stochastic rewards $R_t$. This adaptivity raises interesting but hard statistical inference questions, especially counterfactual ones: for example, it is often of interest to estimate the properties of a hypothetical policy that is different from the logging policy that was used to collect the data -- a problem known as ``off-policy evaluation'' (OPE). Using modern martingale techniques, we present a comprehensive framework for OPE inference that relax unnecessary conditions made in some past works, significantly improving on them both theoretically and empirically. Importantly, our methods can be employed while the original experiment is still running (that is, not necessarily post-hoc), when the logging policy may be itself changing (due to learning), and even if the context distributions are a highly dependent time-series (such as if they are drifting over time). More concretely, we derive confidence sequences for various functionals of interest in OPE. These include doubly robust ones for time-varying off-policy mean reward values, but also confidence bands for the entire cumulative distribution function of the off-policy reward distribution. All of our methods (a) are valid at arbitrary stopping times (b) only make nonparametric assumptions, (c) do not require importance weights to be uniformly bounded and if they are, we do not need to know these bounds, and (d) adapt to the empirical variance of our estimators. In summary, our methods enable anytime-valid off-policy inference using adaptively collected contextual bandit data.

Anytime-valid off-policy inference for contextual bandits

TL;DR

Abstract

to actions

in an attempt to maximize stochastic rewards

. This adaptivity raises interesting but hard statistical inference questions, especially counterfactual ones: for example, it is often of interest to estimate the properties of a hypothetical policy that is different from the logging policy that was used to collect the data -- a problem known as ``off-policy evaluation'' (OPE). Using modern martingale techniques, we present a comprehensive framework for OPE inference that relax unnecessary conditions made in some past works, significantly improving on them both theoretically and empirically. Importantly, our methods can be employed while the original experiment is still running (that is, not necessarily post-hoc), when the logging policy may be itself changing (due to learning), and even if the context distributions are a highly dependent time-series (such as if they are drifting over time). More concretely, we derive confidence sequences for various functionals of interest in OPE. These include doubly robust ones for time-varying off-policy mean reward values, but also confidence bands for the entire cumulative distribution function of the off-policy reward distribution. All of our methods (a) are valid at arbitrary stopping times (b) only make nonparametric assumptions, (c) do not require importance weights to be uniformly bounded and if they are, we do not need to know these bounds, and (d) adapt to the empirical variance of our estimators. In summary, our methods enable anytime-valid off-policy inference using adaptively collected contextual bandit data.

Paper Structure (47 sections, 13 theorems, 108 equations, 6 figures, 2 tables, 1 algorithm)

This paper contains 47 sections, 13 theorems, 108 equations, 6 figures, 2 tables, 1 algorithm.

Introduction
Off-policy inference, confidence intervals, and confidence sequences
Desiderata for anytime-valid off-policy inference
Why allow for logging policies to be predictable?
Why not rely on knowledge of $w_\mathrm{max}$?
Outline and contributions
Related work
Notation: supermartingales, filtrations, and stopping times
Warmup: Off-policy inference for constant policy values
Tighter confidence sequences via doubly robust pseudo-outcomes
Tuning, truncating, and mirroring
Closed-form confidence sequences
Fixed-time confidence intervals
Confidence intervals for policy values.
Inference for time-varying policy values
...and 32 more sections

Key Result

Proposition 1

Suppose $(X_t, A_t, R_t)_{t=1}^\infty$ are iid with $[0, 1]$-valued rewards $(R_t)_{{t}={1}}^{\infty}$, and the logging policy $h$ is fixed. For each ${\nu}' \in [0, 1]$, let $(\lambda_t^L({\nu}'))_{t=1}^\infty$ be any $[0, 1/{\nu}')$-valued predictable sequence. Then, forms a lower $(1-\alpha)$-CS for $\nu$, meaning $\mathbb P(\forall t \in \mathbb N,\ \nu \geq L_t^\mathrm{IW}) \geq 1-\alpha$. S

Figures (6)

Figure 1: Three confidence sequences for a policies with values $\nu = 0.6$ and $\nu = 0.1$. The first CS is built from importance-weighted pseudo-outcomes ("IW"), and the other two are built from doubly robust pseudo-outcomes ("DR") with $k$ taking values 1 and 2, respectively. In these examples, the reward $R_t$ can be predicted easily, a property that only the doubly robust CSs can exploit. Notice that a larger value of $k$ allows the doubly robust CS to become narrower for large $t$, but it pays for this adaptivity with wider bounds at small $t$. Nevertheless, all three CSs are time-uniform, and nonasymptotically valid in both simulation scenarios.
Figure 2: Betting-based (\ref{['theorem:dr-fixed-policy-value']}) and predictable plug-in (PrPl) (\ref{['proposition:prpl-cs']}) CSs for $\nu$ with both importance-weighted (IW) and doubly robust (DR) variants. Notice that for both IW and DR CSs, the betting-based approach of \ref{['theorem:dr-fixed-policy-value']} outperforms the PrPl CSs. Nevertheless, the closed-form PrPl CSs are simpler to implement, and can still benefit from doubly robust variance adaptation.
Figure 3: Fixed-time 90% confidence intervals for $\nu$ using three different methods: a betting-based CI (\ref{['corollary:betting-ci']}), a predictable plug-in (PrPl) CI (\ref{['corollary:prpl-ci']}), and those presented in a paper entitled "High-confidence off-policy evaluation" (HCOPE15) by thomas2015high. Notice that the betting-based CI outperforms the closed-form PrPl CI, which itself significantly outperforms the bounds in thomas2015high.
Figure 4: Various CSs for the time-varying policy value $\widetilde{ \nu}_t$. The left-hand side plot illustrates that while the betting-style CS of \ref{['theorem:dr-fixed-policy-value']} is tight when $\widetilde{ \nu}_t$ remains fixed, it fails to cover when $\widetilde{ \nu}_t$ changes (in this case, there is an abrupt change at $t = 1000$). The right-hand side plot illustrates how \ref{['theorem:conjmix-eb']} and \ref{['proposition:lil-eb']} compare, both using their importance-weighted (IW) and doubly robust (DR) variants. Notice that while LIL-IW and LIL-DR attain optimal rates of convergence, the empirical Bernstein CSs (EB-IW and EB-DR) are much tighter in practice. In both cases, the DR variant outperforms the IW variant due to the reward being easy to predict in this particular example.
Figure 5: An illustration of how the anytime $p$-value derived in \ref{['proposition:testing']} can be used to test the weak null $H_0: \forall t,\ \Delta_t \leq 0$. In the left-hand side plot, notice that $\delta_t$ ventures above 0 at several points prior to $t = 2037$, but the average policy value difference is positive for the first time at $t=2037$. In the right-hand side plot, we see that the anytime $p$-value dips below $\alpha$ shortly after $\Delta_t > 0$, at which point the weak null can be safely rejected, with no penalties for the $p$-value having been continuously monitored.
...and 1 more figures

Theorems & Definitions (26)

Proposition 1: Scalar betting off-policy CS karampatziakis2021off
Theorem 1: Doubly robust betting off-policy CS
Remark 1: Tuning $(\lambda_t^L)_{{t}={1}}^{\infty}$ and $(k_t)_{{t}={1}}^{\infty}$
Remark 2: Why truncate the reward predictor $\widehat{r}_t$?
Remark 3: Mirroring trick for upper CSs
Proposition 2: Closed-form predictable plug-in CS for $\nu$
Corollary 1
Corollary 2
Theorem 2: Empirical Bernstein confidence sequence for $\widetilde{ \nu}_t$
Proposition 3: Variance-adaptive iterated logarithm confidence sequence for $\widetilde{ \nu}_t$
...and 16 more

Anytime-valid off-policy inference for contextual bandits

TL;DR

Abstract

Anytime-valid off-policy inference for contextual bandits

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (26)