Table of Contents
Fetching ...

Wasserstein Distributionally Robust Policy Evaluation and Learning for Contextual Bandits

Yi Shen, Pan Xu, Michael M. Zavlanos

TL;DR

This work tackles distribution shifts in off-policy evaluation and learning for contextual bandits by replacing KL-based uncertainty with Wasserstein distributionally robust optimization (DRO). It develops a dual formulation for policy evaluation, introduces a regularized Wasserstein DRO to mitigate inner-optimization cost, and presents a biased stochastic-gradient method whose complexity is independent of the distribution support. The authors provide finite-sample convergence guarantees for both evaluation and learning, and demonstrate practical robustness on the International Stroke Trial dataset, including improved policies under shifts that KL-based approaches struggle to handle. The results establish Wasserstein DRO as a geometry-aware, scalable framework for robust offline policy evaluation and learning in high-stakes settings.

Abstract

Off-policy evaluation and learning are concerned with assessing a given policy and learning an optimal policy from offline data without direct interaction with the environment. Often, the environment in which the data are collected differs from the environment in which the learned policy is applied. To account for the effect of different environments during learning and execution, distributionally robust optimization (DRO) methods have been developed that compute worst-case bounds on the policy values assuming that the distribution of the new environment lies within an uncertainty set. Typically, this uncertainty set is defined based on the KL divergence around the empirical distribution computed from the logging dataset. However, the KL uncertainty set fails to encompass distributions with varying support and lacks awareness of the geometry of the distribution support. As a result, KL approaches fall short in addressing practical environment mismatches and lead to over-fitting to worst-case scenarios. To overcome these limitations, we propose a novel DRO approach that employs the Wasserstein distance instead. While Wasserstein DRO is generally computationally more expensive compared to KL DRO, we present a regularized method and a practical (biased) stochastic gradient descent method to optimize the policy efficiently. We also provide a theoretical analysis of the finite sample complexity and iteration complexity for our proposed method. We further validate our approach using a public dataset that was recorded in a randomized stoke trial.

Wasserstein Distributionally Robust Policy Evaluation and Learning for Contextual Bandits

TL;DR

This work tackles distribution shifts in off-policy evaluation and learning for contextual bandits by replacing KL-based uncertainty with Wasserstein distributionally robust optimization (DRO). It develops a dual formulation for policy evaluation, introduces a regularized Wasserstein DRO to mitigate inner-optimization cost, and presents a biased stochastic-gradient method whose complexity is independent of the distribution support. The authors provide finite-sample convergence guarantees for both evaluation and learning, and demonstrate practical robustness on the International Stroke Trial dataset, including improved policies under shifts that KL-based approaches struggle to handle. The results establish Wasserstein DRO as a geometry-aware, scalable framework for robust offline policy evaluation and learning in high-stakes settings.

Abstract

Off-policy evaluation and learning are concerned with assessing a given policy and learning an optimal policy from offline data without direct interaction with the environment. Often, the environment in which the data are collected differs from the environment in which the learned policy is applied. To account for the effect of different environments during learning and execution, distributionally robust optimization (DRO) methods have been developed that compute worst-case bounds on the policy values assuming that the distribution of the new environment lies within an uncertainty set. Typically, this uncertainty set is defined based on the KL divergence around the empirical distribution computed from the logging dataset. However, the KL uncertainty set fails to encompass distributions with varying support and lacks awareness of the geometry of the distribution support. As a result, KL approaches fall short in addressing practical environment mismatches and lead to over-fitting to worst-case scenarios. To overcome these limitations, we propose a novel DRO approach that employs the Wasserstein distance instead. While Wasserstein DRO is generally computationally more expensive compared to KL DRO, we present a regularized method and a practical (biased) stochastic gradient descent method to optimize the policy efficiently. We also provide a theoretical analysis of the finite sample complexity and iteration complexity for our proposed method. We further validate our approach using a public dataset that was recorded in a randomized stoke trial.
Paper Structure (24 sections, 13 theorems, 45 equations, 5 figures, 4 tables, 1 algorithm)

This paper contains 24 sections, 13 theorems, 45 equations, 5 figures, 4 tables, 1 algorithm.

Key Result

Lemma 2.5

Consider the dual problem in equation eq:dual_form. If $0\leq f(x)\leq f_{\max}$ for all $x\in\mathcal{X}$, then the optimal solution to equation eq:dual_form satisfies $\lambda^*\in[0, f_{\max}/\epsilon]$ and the optimal value of equation eq:dual_form attained at $\lambda^*$ satisfies $D^* \leq f_{

Figures (5)

  • Figure 1: Two artificial datasets that represent patients' contextual distributions, where $x$ axis is the context support and $y$ axis is the probability. In (a), the distributions $P$ and $Q$ represent the patients' ages and are only different at the ages 55 and 60. The KL divergence is $\text{KL}(Q||P)=+\infty$ since the two distributions have different supports, where the KL divergence between two discrete distributions $P$ and $Q$ is defined as $\text{KL}(Q||P)=\sum_{x\in\mathcal{X}} Q(x) (\log(Q(x)/P(x)).$ In (b), the distributions $Q_1$ and $Q_2$ represent the patients' risk index, the higher the worse. They are equally distant from $P$ under the KL divergence, i.e., $\text{KL}(Q_1||P)=\text{KL}(Q_2||P)=0.21$. However, the distribution $Q_1$ represents a less challenging environment compared to the nominal distribution $P$ as the possibilities of encountering patients with higher risks in $Q_1$ are smaller than both $P$ and $Q_2$. $Q_2$ represents a similar environment as $P$ and is indeed closer to $P$ than $Q_1$ under the Wasserstein distance, i.e., $W(P,Q_1)=2.07$ and $W(P,Q_2)=1.42$. See equation \ref{['eq:wass_distance']} for the Wasserstein distance definition.
  • Figure 2: Distributions
  • Figure 3: Policy evaluation convergence curves of the dual variables. The solid line and shades are averages and standard deviations over 20 runs.
  • Figure 4: Policy learning convergence results of the regulated Wasserstein DRO (BSGD) ($\epsilon_{\text{W}}=0.03$). The solid lines and shades are averages and standard deviations over 20 runs.
  • Figure 5: Policy learning convergence results of the Factor KL DRO (gradient descent) $(\epsilon_{\text{KL}} = 0.1)$. There is no variance since the gradient descent method is deterministic.

Theorems & Definitions (17)

  • Definition 2.3
  • Remark 2.4
  • Lemma 2.5
  • Proposition 3.1
  • Theorem 3.2
  • Remark 4.1
  • Theorem 4.2
  • Theorem 5.1
  • Theorem 5.2
  • Lemma A.1
  • ...and 7 more