Table of Contents
Fetching ...

Intersectional Fairness in Reinforcement Learning with Large State and Constraint Spaces

Eric Eaton, Marcel Hussing, Michael Kearns, Aaron Roth, Sikata Bela Sengupta, Jessica Sorrell

TL;DR

This work addresses intersectional fairness in reinforcement learning with large state and constraint spaces by formulating a multi-objective, state-based group reward model and a minimax objective over groups. It develops oracle-efficient reductions that transform constrained multi-objective RL into standard RL plus a group-constraint optimization, enabling scalability to exponentially many groups. The paper introduces three algorithmic regimes: (i) tabular MDPs with a linear-optimization oracle over $\mathcal{G}$, (ii) large MDPs with separator-set structure using contextual FTPL, and (iii) general group structures via FairFictRL and MORL-BRNR, with proofs of sublinear regret and convergence guarantees in the structured cases. Experiments on a Barabási–Albert graph MDP demonstrate that the proposed methods achieve low constraint violations while maintaining competitive global reward, illustrating practical trade-offs between fairness and efficiency. Overall, the work advances oracle-efficient techniques for ensuring intersectional fairness in RL, with potential impact on real-world decision-making systems where subgroup welfare must be protected across complex, overlapping demographics.

Abstract

In traditional reinforcement learning (RL), the learner aims to solve a single objective optimization problem: find the policy that maximizes expected reward. However, in many real-world settings, it is important to optimize over multiple objectives simultaneously. For example, when we are interested in fairness, states might have feature annotations corresponding to multiple (intersecting) demographic groups to whom reward accrues, and our goal might be to maximize the reward of the group receiving the minimal reward. In this work, we consider a multi-objective optimization problem in which each objective is defined by a state-based reweighting of a single scalar reward function. This generalizes the problem of maximizing the reward of the minimum reward group. We provide oracle-efficient algorithms to solve these multi-objective RL problems even when the number of objectives is exponentially large-for tabular MDPs, as well as for large MDPs when the group functions have additional structure. Finally, we experimentally validate our theoretical results and demonstrate applications on a preferential attachment graph MDP.

Intersectional Fairness in Reinforcement Learning with Large State and Constraint Spaces

TL;DR

This work addresses intersectional fairness in reinforcement learning with large state and constraint spaces by formulating a multi-objective, state-based group reward model and a minimax objective over groups. It develops oracle-efficient reductions that transform constrained multi-objective RL into standard RL plus a group-constraint optimization, enabling scalability to exponentially many groups. The paper introduces three algorithmic regimes: (i) tabular MDPs with a linear-optimization oracle over , (ii) large MDPs with separator-set structure using contextual FTPL, and (iii) general group structures via FairFictRL and MORL-BRNR, with proofs of sublinear regret and convergence guarantees in the structured cases. Experiments on a Barabási–Albert graph MDP demonstrate that the proposed methods achieve low constraint violations while maintaining competitive global reward, illustrating practical trade-offs between fairness and efficiency. Overall, the work advances oracle-efficient techniques for ensuring intersectional fairness in RL, with potential impact on real-world decision-making systems where subgroup welfare must be protected across complex, overlapping demographics.

Abstract

In traditional reinforcement learning (RL), the learner aims to solve a single objective optimization problem: find the policy that maximizes expected reward. However, in many real-world settings, it is important to optimize over multiple objectives simultaneously. For example, when we are interested in fairness, states might have feature annotations corresponding to multiple (intersecting) demographic groups to whom reward accrues, and our goal might be to maximize the reward of the group receiving the minimal reward. In this work, we consider a multi-objective optimization problem in which each objective is defined by a state-based reweighting of a single scalar reward function. This generalizes the problem of maximizing the reward of the minimum reward group. We provide oracle-efficient algorithms to solve these multi-objective RL problems even when the number of objectives is exponentially large-for tabular MDPs, as well as for large MDPs when the group functions have additional structure. Finally, we experimentally validate our theoretical results and demonstrate applications on a preferential attachment graph MDP.

Paper Structure

This paper contains 23 sections, 10 theorems, 46 equations, 3 figures, 6 algorithms.

Key Result

Theorem 3.1

(Informal)[No-Regret Player Guarantee] For a sequence of best-response policies $(D_1,...,D_T)$ and Lagrangian weights $(\lambda_1,...,\lambda_T)$ maintained by the learner and regulator in MORL-BRNR [Algorithm alg:MORL-BRNR], where $reg(T,C,\gamma)$ is sublinear in $T$.

Figures (3)

  • Figure 1: Depiction of the Barabási-Albert graph we use as an MDP. Nodes correspond to states and actions are deterministic moves to other nodes. Every state also has a self-loop action to remain in place. The groups are assigned depending on the number of outgoing edges. All nodes with $1$ or $2$ outgoing edges are in Group $0$, nodes with $3$ outgoing edges are in Group $1$ and all others are in Group $2$. Rewards $r(s, a)$ are assigned as $0.1$ for all $s$ in Group $1$, $0.2$ for all nodes in Group $1$ and $0.3$ for nodes in Group $2$. The start state is a random node in the graph in every episode. Figure (a) shows the occupancy distribution of the (non-fair) optimal policy. As we can see, the non-fair policy's goal is to quickly get to one of the nodes with $5$ edges (Group 2) and stay there indefinitely to accumulate reward. After running our FairFictRL algorithm in Figure (b), the distribution over nodes is almost evenly spread across all nodes. The only outlier is node $8$, as it is the only node that belongs to Group $1$ and thus requires a large visitation number to satisfy our constraints.
  • Figure 2: Average reward to groups during a run of FairFictRL on the MDP depicted in Figure \ref{['fig:graphs']} for $\frac{\alpha}{H}=0.04$. The optimal (non-fair) behavior in the MDP is to move to any Group $2$ node and stay there indefinitely, which achieves at most $0.3$ average reward. As we run FairFictRL, the learned mixture policy quickly ensures that all groups obtain at least $\frac{\alpha}{H}$ average reward.
  • Figure 3: Total average reward compared to per-group average reward for varying $\alpha$. As the fairness constraint increases, total reward decreases while group rewards equal out until eventually all groups obtain the same average reward.

Theorems & Definitions (27)

  • Definition 2.1: Regulator's Regret
  • Definition 2.2: $\nu$-approximate minimax equilibrium
  • Definition 2.3: Regulator's Best Response Function
  • Definition 2.4: Lin-OPT Oracle
  • Definition 2.5: OPT Oracle
  • Definition 2.6: Learner's Best Response Oracle
  • Theorem 3.1
  • Theorem 3.2: Approximate Min-Max
  • proof
  • Lemma 3.3: FTPL Regret
  • ...and 17 more