Table of Contents
Fetching ...

Group-Sensitive Offline Contextual Bandits

Yihong Guo, Junjie Luo, Guodong Gao, Ritu Agarwal, Anqi Liu

TL;DR

This work tackles fairness in offline contextual bandits by introducing a group-sensitive constraint $F(\pi)\le \epsilon$ to curb disparities between two demographic groups. It proposes the Off-Policy Group-Constrained Policy Gradient (GC-PG), which uses a Lagrangian framework and a doubly robust reward estimator to maximize reward while controlling unfair disparity. Theoretical results establish a convergence rate of $O(1/T)$ to a stationary point and bound the estimator error, and experiments on synthetic and real data show reduced group disparity with competitive overall performance, often outperforming a baseline fairness method. The approach is scalable, supports multiple groups with a practical surrogate, and has potential for broader fairness notions in offline policy optimization.

Abstract

Offline contextual bandits allow one to learn policies from historical/offline data without requiring online interaction. However, offline policy optimization that maximizes overall expected rewards can unintentionally amplify the reward disparities across groups. As a result, some groups might benefit more than others from the learned policy, raising concerns about fairness, especially when the resources are limited. In this paper, we study a group-sensitive fairness constraint in offline contextual bandits, reducing group-wise reward disparities that may arise during policy learning. We tackle the following common-parity requirements: the reward disparity is constrained within some user-defined threshold or the reward disparity should be minimized during policy optimization. We propose a constrained offline policy optimization framework by introducing group-wise reward disparity constraints into an off-policy gradient-based optimization procedure. To improve the estimation of the group-wise reward disparity during training, we employ a doubly robust estimator and further provide a convergence guarantee for policy optimization. Empirical results in synthetic and real-world datasets demonstrate that our method effectively reduces reward disparities while maintaining competitive overall performance.

Group-Sensitive Offline Contextual Bandits

TL;DR

This work tackles fairness in offline contextual bandits by introducing a group-sensitive constraint to curb disparities between two demographic groups. It proposes the Off-Policy Group-Constrained Policy Gradient (GC-PG), which uses a Lagrangian framework and a doubly robust reward estimator to maximize reward while controlling unfair disparity. Theoretical results establish a convergence rate of to a stationary point and bound the estimator error, and experiments on synthetic and real data show reduced group disparity with competitive overall performance, often outperforming a baseline fairness method. The approach is scalable, supports multiple groups with a practical surrogate, and has potential for broader fairness notions in offline policy optimization.

Abstract

Offline contextual bandits allow one to learn policies from historical/offline data without requiring online interaction. However, offline policy optimization that maximizes overall expected rewards can unintentionally amplify the reward disparities across groups. As a result, some groups might benefit more than others from the learned policy, raising concerns about fairness, especially when the resources are limited. In this paper, we study a group-sensitive fairness constraint in offline contextual bandits, reducing group-wise reward disparities that may arise during policy learning. We tackle the following common-parity requirements: the reward disparity is constrained within some user-defined threshold or the reward disparity should be minimized during policy optimization. We propose a constrained offline policy optimization framework by introducing group-wise reward disparity constraints into an off-policy gradient-based optimization procedure. To improve the estimation of the group-wise reward disparity during training, we employ a doubly robust estimator and further provide a convergence guarantee for policy optimization. Empirical results in synthetic and real-world datasets demonstrate that our method effectively reduces reward disparities while maintaining competitive overall performance.

Paper Structure

This paper contains 16 sections, 7 theorems, 37 equations, 4 figures, 7 tables, 3 algorithms.

Key Result

Lemma 4.3

(Variance of the Doubly Robust Estimator, Theorem 2 in dudik2011doubly) Let $\Delta(a, x) = \hat{r}(x, a) - r(x, a)$, let $\xi = \frac{(r(x, a) - \hat{r}(x, a))\pi_\theta(a|x)}{\pi_\beta(a|x)}$ and the logging policy is known, then the variance of the doubly robust estimator is:

Figures (4)

  • Figure 1: The reward of two groups from a real pick-up reminder message sending task where the goal is to improve the pickup rate of prescriptions. Policy optimization without group-sensitive fairness increases the reward disparity.
  • Figure 2: Results on prescription pickup reminder messages. GC-PG improves pickup rates and reduces reward disparity, achieving performance comparable to the unconstrained policy and lower disparity.
  • Figure 3: Scatter plot of reward $(R_1, R_2)$ on two groups using three different logging policies with different $\epsilon$. Our method can reduce the reward disparity, but also has a trade-off between fairness and per-group reward. We mark the best policy, which is not Pareto dominated by others and most fair, with "star".
  • Figure 4: Scatter plot of reward $(R_1, R_2)$ on two groups using three different logging policies with different $\epsilon$ on Drug dataset. First row: education as the sensitive feature; second row: gender as the sensitive feature. Our method can reduce the reward disparity, but also has a trade-off between fairness and per-group reward. We mark the best policy with "star" (Not Pareto dominated by others and most fair).

Theorems & Definitions (9)

  • Lemma 4.3
  • Theorem 4.4
  • Theorem 4.5
  • Theorem 4.6
  • Theorem 4.4
  • Theorem 4.5
  • Theorem 4.6
  • Definition B.1: $\varepsilon$-Fair Pareto Optimality
  • Definition B.2: Global Pareto Optimality