Table of Contents
Fetching ...

Adapting Static Fairness to Sequential Decision-Making: Bias Mitigation Strategies towards Equal Long-term Benefit Rate

Yuancheng Xu, Chenghao Deng, Yanchao Sun, Ruijie Zheng, Xiyao Wang, Jieyu Zhao, Furong Huang

TL;DR

This work introduces Equal Long-term Benefit Rate (ELBERT), a ratio-after-aggregation fairness notion for sequential decision-making modeled via a Supply-Demand MDP (SD-MDP), where a group's long-term well-being is $\frac{\eta_g^S(\pi)}{\eta_g^D(\pi)}$ and bias is $b(\pi)=\max_g\frac{\eta_g^S(\pi)}{\eta_g^D(\pi)}-\min_g\frac{\eta_g^S(\pi)}{\eta_g^D(\pi)}$. To optimize under fairness, the paper derives a fairness-aware policy gradient that reduces to standard policy gradients, enabling ELBERT-PO with PPO updates; for multi-group settings, it introduces a soft-bias surrogate $b^{\text{soft}}(\pi)$ with temperature $\beta$ and proves $b(\pi) \le b^{\text{soft}}(\pi) \le b(\pi)+\frac{2\log M}{\beta}$. The method is validated in lending, infectious disease control, and attention allocation, showing substantial bias reductions with high utility, and the work discusses extensions to demand-regularized objectives and broader implications for ethical AI in sequential tasks.

Abstract

Decisions made by machine learning models can have lasting impacts, making long-term fairness a critical consideration. It has been observed that ignoring the long-term effect and directly applying fairness criterion in static settings can actually worsen bias over time. To address biases in sequential decision-making, we introduce a long-term fairness concept named Equal Long-term Benefit Rate (ELBERT). This concept is seamlessly integrated into a Markov Decision Process (MDP) to consider the future effects of actions on long-term fairness, thus providing a unified framework for fair sequential decision-making problems. ELBERT effectively addresses the temporal discrimination issues found in previous long-term fairness notions. Additionally, we demonstrate that the policy gradient of Long-term Benefit Rate can be analytically simplified to standard policy gradients. This simplification makes conventional policy optimization methods viable for reducing bias, leading to our bias mitigation approach ELBERT-PO. Extensive experiments across various diverse sequential decision-making environments consistently reveal that ELBERT-PO significantly diminishes bias while maintaining high utility. Code is available at https://github.com/umd-huang-lab/ELBERT.

Adapting Static Fairness to Sequential Decision-Making: Bias Mitigation Strategies towards Equal Long-term Benefit Rate

TL;DR

This work introduces Equal Long-term Benefit Rate (ELBERT), a ratio-after-aggregation fairness notion for sequential decision-making modeled via a Supply-Demand MDP (SD-MDP), where a group's long-term well-being is and bias is . To optimize under fairness, the paper derives a fairness-aware policy gradient that reduces to standard policy gradients, enabling ELBERT-PO with PPO updates; for multi-group settings, it introduces a soft-bias surrogate with temperature and proves . The method is validated in lending, infectious disease control, and attention allocation, showing substantial bias reductions with high utility, and the work discusses extensions to demand-regularized objectives and broader implications for ethical AI in sequential tasks.

Abstract

Decisions made by machine learning models can have lasting impacts, making long-term fairness a critical consideration. It has been observed that ignoring the long-term effect and directly applying fairness criterion in static settings can actually worsen bias over time. To address biases in sequential decision-making, we introduce a long-term fairness concept named Equal Long-term Benefit Rate (ELBERT). This concept is seamlessly integrated into a Markov Decision Process (MDP) to consider the future effects of actions on long-term fairness, thus providing a unified framework for fair sequential decision-making problems. ELBERT effectively addresses the temporal discrimination issues found in previous long-term fairness notions. Additionally, we demonstrate that the policy gradient of Long-term Benefit Rate can be analytically simplified to standard policy gradients. This simplification makes conventional policy optimization methods viable for reducing bias, leading to our bias mitigation approach ELBERT-PO. Extensive experiments across various diverse sequential decision-making environments consistently reveal that ELBERT-PO significantly diminishes bias while maintaining high utility. Code is available at https://github.com/umd-huang-lab/ELBERT.
Paper Structure (45 sections, 4 theorems, 35 equations, 9 figures, 1 table)

This paper contains 45 sections, 4 theorems, 35 equations, 9 figures, 1 table.

Key Result

Proposition 3.1

The gradient of the objective function can be calculated as where $\frac{\partial h}{\partial z_g}$ is the partial derivative of $h$ w.r.t. its $g$-th coordinate, evaluated at $\left(\frac{\eta^{S}_1(\pi)}{\eta^{D}_1(\pi)},\frac{\eta^{S}_2(\pi)}{\eta^{D}_2(\pi)}\right)$.

Figures (9)

  • Figure 1: (A) A loan application example. At time step $t$, the bank approves $0$ loans out of $1$ blue applicants and $0$ loan out of $100$ red applicant. At time $t+1$, the bank approves $100$ loans out of $100$ blue applicants and $1$ loan out of $1$ red applicant. (B) Based on a ratio-before-aggregation notion, our Long-term Benefit Rate calculates the bias as $|\textcolor{blue}{\frac{100}{101}}-\textcolor{red}{\frac{1}{101}}|$, suggesting the biased decisions by the bank.
  • Figure 2: Supply Demand MDP (SD-MDP). In addition to the standard MDP (in black), SD-MDP returns group demand and group supply as fairness signals (in green).
  • Figure 3: (A) The former trajectory in \ref{['fig:motivating_example']} where the only approval of red group is assigned to $1$ applicant at $t+1$. (B) A new trajectory, where the only difference is that the approval of red group is assigned to one among $100$ applicants at $t$.
  • Figure 4: Reward and bias of ELBERT-PO (ours) and three other baselines (A-PPO, G-PPO, R-PPO) in three environments (lending, infectious disease control, attention allocation). Each column shows the results in one environment. The third row shows the average reward versus the average bias, where ELBERT-PO consistently appears at the upper-left corner.
  • Figure 5: Learning curve of ELBERT-PO on the attention allocation environment with different $\alpha$.
  • ...and 4 more figures

Theorems & Definitions (8)

  • Definition 2.1: Supply-Demand MDP (SD-MDP)
  • Definition 2.2: Cumulative Supply and Demand
  • Definition 2.3: Long-term Benefit Rate
  • Proposition 3.1
  • Proposition 3.2
  • Proposition 3.3: Approximation property of the soft bias
  • Proposition 2.1
  • proof