Table of Contents
Fetching ...

Adaptive Off-Policy Inference for M-Estimators Under Model Misspecification

James Leiner, Robin Dunn, Aaditya Ramdas

TL;DR

This work develops a framework for valid off-policy inference in adaptive settings with potential model misspecification by targeting a fixed evaluation policy $\pi_e$ and introducing the MAIPWM estimator that augments $M$-estimators with predictive models. A central limit theorem is proved under mild, verifiable conditions, using time-varying, data-driven variance estimators to stabilize the score's variance when action probabilities do not converge. The authors provide practical covariance-estimation strategies based on external data or sequential sample splitting and demonstrate nominal coverage on semi-synthetic Osteoarthritis data where traditional methods fail. The results hold even when the evaluation policy is unstable or non-convergent, highlighting the method's robustness and potential for reliable inference in real-world adaptive experiments.

Abstract

When data are collected adaptively, such as in bandit algorithms, classical statistical approaches such as ordinary least squares and $M$-estimation will often fail to achieve asymptotic normality. Although recent lines of work have modified the classical approaches to ensure valid inference on adaptively collected data, most of these works assume that the model is correctly specified. The misspecified setting poses unique challenges because the parameter of interest itself may not be well-defined over a non-stationary distribution of rewards. We therefore tackle the problem of \emph{off-policy} inference in adaptive settings, where we uniquely define a projected solution over a stationary evaluation policy. Our method provides valid inference for $M$-estimators that use adaptively collected bandit data with a possibly misspecified working model. A key ingredient in our approach is the use of flexible approaches to stabilize the variance induced by adaptive data collection. A major novelty is that the procedure enables the construction of valid confidence sets even in settings where treatment policies are unstable and non-converging, such as when there is no unique optimal arm and standard bandit algorithms are used. Empirical results on semi-synthetic datasets constructed from the Osteoarthritis Initiative demonstrate that the method maintains type I error control, while existing methods for inference in adaptive settings do not cover in the misspecified case.

Adaptive Off-Policy Inference for M-Estimators Under Model Misspecification

TL;DR

This work develops a framework for valid off-policy inference in adaptive settings with potential model misspecification by targeting a fixed evaluation policy and introducing the MAIPWM estimator that augments -estimators with predictive models. A central limit theorem is proved under mild, verifiable conditions, using time-varying, data-driven variance estimators to stabilize the score's variance when action probabilities do not converge. The authors provide practical covariance-estimation strategies based on external data or sequential sample splitting and demonstrate nominal coverage on semi-synthetic Osteoarthritis data where traditional methods fail. The results hold even when the evaluation policy is unstable or non-convergent, highlighting the method's robustness and potential for reliable inference in real-world adaptive experiments.

Abstract

When data are collected adaptively, such as in bandit algorithms, classical statistical approaches such as ordinary least squares and -estimation will often fail to achieve asymptotic normality. Although recent lines of work have modified the classical approaches to ensure valid inference on adaptively collected data, most of these works assume that the model is correctly specified. The misspecified setting poses unique challenges because the parameter of interest itself may not be well-defined over a non-stationary distribution of rewards. We therefore tackle the problem of \emph{off-policy} inference in adaptive settings, where we uniquely define a projected solution over a stationary evaluation policy. Our method provides valid inference for -estimators that use adaptively collected bandit data with a possibly misspecified working model. A key ingredient in our approach is the use of flexible approaches to stabilize the variance induced by adaptive data collection. A major novelty is that the procedure enables the construction of valid confidence sets even in settings where treatment policies are unstable and non-converging, such as when there is no unique optimal arm and standard bandit algorithms are used. Empirical results on semi-synthetic datasets constructed from the Osteoarthritis Initiative demonstrate that the method maintains type I error control, while existing methods for inference in adaptive settings do not cover in the misspecified case.

Paper Structure

This paper contains 40 sections, 10 theorems, 122 equations, 6 figures, 1 algorithm.

Key Result

Lemma 1

Assume $\mathcal{G}_{\Theta}=\{ g_{\theta}(X_t,A_{t},Y_{t}) : \theta \in \Theta \}$ is a class of functions such that for any $\epsilon >0$, the bracketing number $N_{[]}(\epsilon, \mathcal{G}_{\Theta}, L_{2}(\mathcal{P},\pi_{e})) < \infty$. Define Assume that there exists a constant $L$ not dependent on $t$ such that almost surely. Then under Assumptions assumption:potential_outcomes-assumptio

Figures (6)

  • Figure 1: Illustration for \ref{['example1']}. Under the misspecified linear model $m_{\theta} = -(Y_{t} - \theta_{0} - \theta_{1}A_{t})^{2}$, $(\theta_{0}^{\star},\theta_{1}^{\star}) = (-0.2, 4)$ for the first policy and $(\theta_{0}^{\star},\theta_{1}^{\star}) = (-5.3, 15)$ for the second policy. Intuitively, the misspecified model will try to estimate a secant line at different points of the quadratic function, resulting in very different interpretations for the target parameter. As such, the choice of evaluation policy is critical for properly interpreting the target parameter.
  • Figure 2: Confidence intervals constructed using \ref{['thm:clt']} using ML-based estimates of the variance cover in all scenarios. GLMs using naive inverse propensity weighting often undercover, especially in situations where the assignment probabilities vary substantially over time. We note that both sample splitting and external data reuse for covariance estimation are valid, but sample splitting has significantly wider confidence intervals. Reusing the same data for variance estimation and parameter estimation performs similarly to using external data, though we lack theoretical guarantees for this method.
  • Figure 3: Simulation results for scenario 1, where the model is correctly specified. In these settings, we see that all methods cover, but it often takes the naive approaches (IPW and SQ-IPW) estimates significantly more samples to reach the nominal coverage rates. Thompson sampling undercovers for several methods when propensity scores are not clipped, consistent with theory.
  • Figure 4: Simulation results for scenario 2, where the model is correctly specified but there is not a unique optimal arm. The results are broadly similar as in \ref{['fig:simulation_results_5']}, though we note there is no theoretical guarantee that IPW estimates should cover in this setting (though they do empirically).
  • Figure 5: Simulation results for scenario 3, when there is model misspecification but homoskedastic errors. IPW and SQ-IPW estimators now do not cover in every scenario, as their theoretical guarantees depend on correct specification of the working models.
  • ...and 1 more figures

Theorems & Definitions (20)

  • Example 1: Effect of $\pi_{e}$ on $\theta^{\star}$
  • Remark 1: Average Treatment Effect
  • Lemma 1
  • Theorem 1
  • Proposition 1
  • Proposition 2
  • Proposition 3
  • Proposition 4
  • Proposition 5
  • Remark 2
  • ...and 10 more