Table of Contents
Fetching ...

Off-policy evaluation beyond overlap: partial identification through smoothness

Samir Khan, Martin Saveski, Johan Ugander

TL;DR

This work considers a strategy based on partial identification under non-parametric assumptions on the conditional mean function, focusing especially on Lipschitz smoothness, and forms a pair of linear programs whose optimal values upper and lower bound the contributions of the no-overlap region to the off-policy value.

Abstract

Off-policy evaluation (OPE) is the problem of estimating the value of a target policy using historical data collected under a different logging policy. OPE methods typically assume overlap between the target and logging policy, enabling solutions based on importance weighting and/or imputation. In this work, we approach OPE without assuming either overlap or a well-specified model by considering a strategy based on partial identification under non-parametric assumptions on the conditional mean function, focusing especially on Lipschitz smoothness. Under such smoothness assumptions, we formulate a pair of linear programs whose optimal values upper and lower bound the contributions of the no-overlap region to the off-policy value. We show that these linear programs have a concise closed form solution that can be computed efficiently and that their solutions converge, under the Lipschitz assumption, to the sharp partial identification bounds on the off-policy value. Furthermore, we show that the rate of convergence is minimax optimal, up to log factors. We deploy our methods on two semi-synthetic examples, and obtain informative and valid bounds that are tighter than those possible without smoothness assumptions.

Off-policy evaluation beyond overlap: partial identification through smoothness

TL;DR

This work considers a strategy based on partial identification under non-parametric assumptions on the conditional mean function, focusing especially on Lipschitz smoothness, and forms a pair of linear programs whose optimal values upper and lower bound the contributions of the no-overlap region to the off-policy value.

Abstract

Off-policy evaluation (OPE) is the problem of estimating the value of a target policy using historical data collected under a different logging policy. OPE methods typically assume overlap between the target and logging policy, enabling solutions based on importance weighting and/or imputation. In this work, we approach OPE without assuming either overlap or a well-specified model by considering a strategy based on partial identification under non-parametric assumptions on the conditional mean function, focusing especially on Lipschitz smoothness. Under such smoothness assumptions, we formulate a pair of linear programs whose optimal values upper and lower bound the contributions of the no-overlap region to the off-policy value. We show that these linear programs have a concise closed form solution that can be computed efficiently and that their solutions converge, under the Lipschitz assumption, to the sharp partial identification bounds on the off-policy value. Furthermore, we show that the rate of convergence is minimax optimal, up to log factors. We deploy our methods on two semi-synthetic examples, and obtain informative and valid bounds that are tighter than those possible without smoothness assumptions.
Paper Structure (35 sections, 9 theorems, 77 equations, 6 figures, 3 tables)

This paper contains 35 sections, 9 theorems, 77 equations, 6 figures, 3 tables.

Key Result

Theorem 1

Suppose that $\hat{\psi}_{\mathop{\mathrm{1}}\limits}$ is a consistent estimator of $\psi_{\mathop{\mathrm{1}}\limits}(P_0)$, and that $\hat{\psi}_{\mathop{\mathrm{2}}\limits}^-$ and $\hat{\psi}_{\mathop{\mathrm{2}}\limits}^+$ are consistent estimators of $\psi_2^-$ and $\psi_2^+$, respectively. The

Figures (6)

  • Figure 1: A visualization of our approach on a toy problem with three data points in a one-dimensional covariate space. The data are collected using a behavior policy $\pi_b$. The function $\hat{\mu}(x)$ (dashed black) is estimated based on the points observed in the overlap region where $\pi_b>0$ and treatment probability is positive, so we are unsure whether it is accurate in the no-overlap region where $\pi_b=0$ and treatment probability is zero (red). Rather than directly using the predictions of $\hat{\mu}$ in the no-overlap region, we assume the true outcome is $L$-Lipschitz and use this assumption to provide upper and lower bounds.
  • Figure 2: An example providing intuition for how, at the optimal solution of \ref{['eq:estim_lp']}, constraints between pairs of points both in the no-overlap region are not active in the LP. The $i=1$ point (black) is in the overlap region, and the $i=2,3$ points (red) are in the no-overlap region. We must have $t_1=\hat{\mu}(X_1)$. The smoothness between 1 and 2 (and 1 and 3) imply bounds on $t_2$ (and $t_3$), with lower bounds shown along the blue arrows. Also shown is the lower bound implied on $t_3$ by $t_2$ when $t_2$ is set to $\hat{\mu}(X_1)-Ld(X_1, X_2)$ (which is the choice that will minimize the objective). Since $d(X_1, X_2)+d(X_2, X_3)>d(X_1, X_3)$, by the triangle inequality, the bound implied on $t_3$ by $t_2$'s bound from $t_1$ is always looser than the bound implied directly by $t_1$. Visually, the bound in dark blue is always tighter than the bound in light blue.
  • Figure 3: Partial identification bounds under the assumption that $P_0\in \mathcal{P}_L^{\mathop{\mathrm{Lip}}\limits}\cap \mathcal{P}_{0,1}^{\mathop{\mathrm{bdd}}\limits}$ (black), Manski bounds under the assumption that $P_0\in \mathcal{P}_{0, 1}^{\mathop{\mathrm{bdd}}\limits}$ (grey), and pure imputation estimates (blue) of the value of the policy $\pi^{(T)}$ estimated using data from the behavior policy $\pi^{(0.5)}$ for a range of $T$ and $L$. Also shown are the point estimate and confidence intervals estimated using infeasible sample data from the uniform behavior policy. We see that the pure imputation estimate overestimates the value of $\pi^{(T)}$, but that our partial identification bounds correct for this. Crucially, the width of our partial identification interval increases as the model estimate and infeasible sample estimate diverge, meaning that we are correctly adjusting for the model's ability to extrapolate into the no-overlap region. For reference, there are overlap violations for 16.8%, 10.8%, 7.0%, and 4.3% of the points when $T=0.25$, $T=0.3$, $T=0.35$, and $T=0.4$ respectively.
  • Figure 4: Visualization of results from Table \ref{['tab:yeast']} for the values of $L$ for which the optimization problem is consistently feasible and thus the smoothness assumption is plausible. We see that as $n\to \infty$, the coverage (as defined in Theorem \ref{['thm:interval_consistency']} with $\epsilon = 0.01$) approaches 100%.
  • Figure 5: The same experiment as in Figure \ref{['fig:yahoo_ope_manski']} over a wider range of values of $T$ and $L$. With this range of values, we see convergence of our bounds to the Manski bounds as $L$ grows large, and also that when $T=0.5$ and there are no longer any overlap violations, our intervals have width zero as expected.
  • ...and 1 more figures

Theorems & Definitions (19)

  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Theorem 4
  • proof : Proof of Lemma \ref{['thm:interval_consistency']}
  • Theorem 5
  • proof : Proof of Theorem \ref{['thm:mu_hat_lip_gen']}(a)
  • proof : Proof of Theorem \ref{['thm:mu_hat_lip_gen']}(b)
  • Lemma 1
  • proof : Proof of Lemma \ref{['lem:sup_diff']}
  • ...and 9 more