Table of Contents
Fetching ...

Fairness in Reinforcement Learning with Bisimulation Metrics

Sahand Rezaei-Shoshtari, Hanna Yurchyk, Scott Fujimoto, Doina Precup, David Meger

TL;DR

The paper tackles long-term group fairness in reinforcement learning by linking demographic parity to bisimulation metrics. It introduces Bisimulator, which unconstrainedly optimizes reward and observation dynamics guided by a group-conditioned bisimulation metric, leaving the underlying RL solver unchanged. The authors formalize group-conditioned pi-bisimulation, derive value-bound relations, and propose a practical algorithm that minimizes a joint bisimulation-based loss via quantile-matched state-group pairs, demonstrated on lending and college admissions benchmarks with PPO and DQN. This approach provides a scalable, solver-agnostic pathway to reduce disparities over time, while maintaining competitive performance in dynamic, sequential decision problems.

Abstract

Ensuring long-term fairness is crucial when developing automated decision making systems, specifically in dynamic and sequential environments. By maximizing their reward without consideration of fairness, AI agents can introduce disparities in their treatment of groups or individuals. In this paper, we establish the connection between bisimulation metrics and group fairness in reinforcement learning. We propose a novel approach that leverages bisimulation metrics to learn reward functions and observation dynamics, ensuring that learners treat groups fairly while reflecting the original problem. We demonstrate the effectiveness of our method in addressing disparities in sequential decision making problems through empirical evaluation on a standard fairness benchmark consisting of lending and college admission scenarios.

Fairness in Reinforcement Learning with Bisimulation Metrics

TL;DR

The paper tackles long-term group fairness in reinforcement learning by linking demographic parity to bisimulation metrics. It introduces Bisimulator, which unconstrainedly optimizes reward and observation dynamics guided by a group-conditioned bisimulation metric, leaving the underlying RL solver unchanged. The authors formalize group-conditioned pi-bisimulation, derive value-bound relations, and propose a practical algorithm that minimizes a joint bisimulation-based loss via quantile-matched state-group pairs, demonstrated on lending and college admissions benchmarks with PPO and DQN. This approach provides a scalable, solver-agnostic pathway to reduce disparities over time, while maintaining competitive performance in dynamic, sequential decision problems.

Abstract

Ensuring long-term fairness is crucial when developing automated decision making systems, specifically in dynamic and sequential environments. By maximizing their reward without consideration of fairness, AI agents can introduce disparities in their treatment of groups or individuals. In this paper, we establish the connection between bisimulation metrics and group fairness in reinforcement learning. We propose a novel approach that leverages bisimulation metrics to learn reward functions and observation dynamics, ensuring that learners treat groups fairly while reflecting the original problem. We demonstrate the effectiveness of our method in addressing disparities in sequential decision making problems through empirical evaluation on a standard fairness benchmark consisting of lending and college admission scenarios.

Paper Structure

This paper contains 38 sections, 6 theorems, 22 equations, 11 figures, 8 tables, 1 algorithm.

Key Result

Theorem 1

$\mathcal{F}^\pi_{\text{group}}$ as defined in eq:group_bisim has a least fixed point $d^\pi_{\text{group} \sim}$, and $d^\pi_{\text{group} \sim}$ is a group-conditioned $\pi$-bisimulation metric.

Figures (11)

  • Figure 1: Lending results. The first row (a-d) shows the lending scenario where the repayment probability is only a function of the credit score, while the second row (e-f) presents the case where the repayment probability is a function of the credit score and a latent conscientiousness parameter. (a, e) Average return. (b, f) Recall for group 1. (c, g) Recall for group 2. (d, h) Credit gap measured as the Kantorovich distance between the credit score distributions at the end of evaluation episodes. The shaded regions show 95% confidence intervals and plots are smoothed for visual clarity.
  • Figure 2: Credit gaps of Bisimulator and PPO. Solid lines show the gap between the actual credit scores that govern the MDP dynamics, and the dashed line shows the gap between the modified credit scores that are observed by the agent.
  • Figure 3: College admission results. The shaded regions show 95% confidence intervals and plots are smoothed for visual clarity.
  • Figure 4: Initial credit score distribution for each group.
  • Figure 5: Lending results. Cumulative loans given to each group over the course of evaluation episodes. The first row (a, b) shows the lending scenario where the repayment probability is only a function of the credit score, while the second row (c, d) presents the case where the repayment probability is a function of the credit score and a latent conscientiousness parameter. Results are obtained on 10 seeds and 5 evaluations episodes per seed. Confidence intervals are not shown for visual clarity.
  • ...and 6 more figures

Theorems & Definitions (15)

  • Definition 1: Bisimulation
  • Definition 2: $\pi$-Bisimulation
  • Definition 3: Group
  • Definition 4: Group-conditioned MDP
  • Definition 5: Demographic parity fairness in RL satija2023group
  • Definition 6: Group-conditioned $\pi$-Bisimulation
  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Theorem 3
  • ...and 5 more