Table of Contents
Fetching ...

On Mitigating Affinity Bias through Bandits with Evolving Biased Feedback

Matthew Faw, Constantine Caramanis, Jessica Hoffmann

TL;DR

The paper studies affinity bias in sequential decision making by introducing affinity bandits, a non-stationary bandit variant where observed feedback is biased by the fraction of times arms are chosen. It formalizes a bias model with $W_i(t)=f\left(\frac{T^0_i+T_i(t-1)}{t_0^{\text{bias}}+t-1}\right)$ and shows that standard algorithms fail due to the evolving bias, even when real rewards remain fixed. The authors prove a new instance-dependent lower bound that scales with the number of arms $K$ and design a near-optimal elimination-style algorithm that achieves sublinear regret despite unknown bias, using a novel divergence-based analysis extended to time-varying, policy-dependent feedback. They also provide an asymptotic lower bound showing that the bias can inflate regret by a multiplicative factor tied to $f$ and $K$, establishing near-tightness of their upper bounds. Overall, the work offers both fundamental limits and practical strategies for mitigating affinity bias in hiring-like sequential decision problems with evolving biased feedback.

Abstract

Unconscious bias has been shown to influence how we assess our peers, with consequences for hiring, promotions and admissions. In this work, we focus on affinity bias, the component of unconscious bias which leads us to prefer people who are similar to us, despite no deliberate intention of favoritism. In a world where the people hired today become part of the hiring committee of tomorrow, we are particularly interested in understanding (and mitigating) how affinity bias affects this feedback loop. This problem has two distinctive features: 1) we only observe the biased value of a candidate, but we want to optimize with respect to their real value 2) the bias towards a candidate with a specific set of traits depends on the fraction of people in the hiring committee with the same set of traits. We introduce a new bandits variant that exhibits those two features, which we call affinity bandits. Unsurprisingly, classical algorithms such as UCB often fail to identify the best arm in this setting. We prove a new instance-dependent regret lower bound, which is larger than that in the standard bandit setting by a multiplicative function of $K$. Since we treat rewards that are time-varying and dependent on the policy's past actions, deriving this lower bound requires developing proof techniques beyond the standard bandit techniques. Finally, we design an elimination-style algorithm which nearly matches this regret, despite never observing the real rewards.

On Mitigating Affinity Bias through Bandits with Evolving Biased Feedback

TL;DR

The paper studies affinity bias in sequential decision making by introducing affinity bandits, a non-stationary bandit variant where observed feedback is biased by the fraction of times arms are chosen. It formalizes a bias model with and shows that standard algorithms fail due to the evolving bias, even when real rewards remain fixed. The authors prove a new instance-dependent lower bound that scales with the number of arms and design a near-optimal elimination-style algorithm that achieves sublinear regret despite unknown bias, using a novel divergence-based analysis extended to time-varying, policy-dependent feedback. They also provide an asymptotic lower bound showing that the bias can inflate regret by a multiplicative factor tied to and , establishing near-tightness of their upper bounds. Overall, the work offers both fundamental limits and practical strategies for mitigating affinity bias in hiring-like sequential decision problems with evolving biased feedback.

Abstract

Unconscious bias has been shown to influence how we assess our peers, with consequences for hiring, promotions and admissions. In this work, we focus on affinity bias, the component of unconscious bias which leads us to prefer people who are similar to us, despite no deliberate intention of favoritism. In a world where the people hired today become part of the hiring committee of tomorrow, we are particularly interested in understanding (and mitigating) how affinity bias affects this feedback loop. This problem has two distinctive features: 1) we only observe the biased value of a candidate, but we want to optimize with respect to their real value 2) the bias towards a candidate with a specific set of traits depends on the fraction of people in the hiring committee with the same set of traits. We introduce a new bandits variant that exhibits those two features, which we call affinity bandits. Unsurprisingly, classical algorithms such as UCB often fail to identify the best arm in this setting. We prove a new instance-dependent regret lower bound, which is larger than that in the standard bandit setting by a multiplicative function of . Since we treat rewards that are time-varying and dependent on the policy's past actions, deriving this lower bound requires developing proof techniques beyond the standard bandit techniques. Finally, we design an elimination-style algorithm which nearly matches this regret, despite never observing the real rewards.

Paper Structure

This paper contains 26 sections, 30 theorems, 202 equations, 10 figures, 3 algorithms.

Key Result

Theorem 4.0

Suppose that alg:eliminationUnknownBias is run for $n$ time-steps in an environment $\boldsymbol{\nu}$ with bias model satisfying assump:multBias with Lipschitz constant $L$ and $\mu_i \in [0,1]$ for all $i\in [K]$, using the sampling schedule $m_{r} = 2^{2r + 6}\log(\frac{12}{\pi^2} K^2 r^2 n)$. Fu

Figures (10)

  • Figure 1: Representation of our setting when each arm has been picked exactly once. The expected biased feedback is the expected real reward divided by $K$. The ordering of the observed rewards is identical to that of the real rewards, but the suboptimality gaps are divided by $K$.
  • Figure 2: From the setting in \ref{['fig:biasModel1']}, we picked arm 2. The biased feedback for arm 2 now appears better than the one for arm 1, the real best arm. Moreover, while the fraction for arm 2 increases, the fraction for all the other arm decreases.
  • Figure 3: Empirical probability of the suboptimal arm being pulled by more than $1/2$ of the time horizon as a function of the initial bias. We show results for UCB, EXP3 and EXP3-IX. For high initial weight on the suboptimal arm, all three algorithms are more likely to pull it more than the optimal arm. Moreover, even for high weight on the optimal arm, the probability that the suboptimal arm is pulled more than the optimal arm can be bounded away from 0.
  • Figure 4: The number of times the suboptimal arm is pulled as a function of time for UCB-Vin two environments, normalized by $\sqrt{t}$. In the first, UCB-V receives the true rewards $X_{A_t,t}$ as samples. In the second environment, UCB-V receives "debiased" feedback $Y_tW_{A_t}(t)^{-1}$ as samples. While we cannot conclude whether the regret of UCB-V grows as $\sqrt{t}$ from this graph, it is unlikely it grows as $\log(t)$.
  • Figure 5: A depiction of the environment construction for \ref{['thm:ucbLinearRegretAppendix']} at $t=1$. The left side of the figure shows the means in the original environments $\boldsymbol{\nu},\boldsymbol{\nu}^{\mathrm{bias}}$. The right side shows the "frozen" environment $\widetilde{\boldsymbol{\nu}}^{\mathrm{st}}$ used for our proof. Notice that the biased optimal arm is arm $2$, not the true optimal arm $1$. Further, $\widetilde{\Delta}^{\mathrm{st}} < \Delta^{\mathrm{bias}}(1)$.
  • ...and 5 more figures

Theorems & Definitions (61)

  • Theorem 4.0: Regret guarantee for \ref{['alg:eliminationUnknownBias']}; Simplified version of \ref{['thm:eliminationUnknownBiasAppendix']} and \ref{['cor:regretConstSuboptimalityGaps']}
  • Definition 5.0: Consistent policy
  • Theorem 5.0: Informal statement of \ref{['thm:lowerBoundRegretAppendix']}
  • Remark 5.0: Comparison to standard bandit regret lower bound
  • Corollary 5.0: Comparison of \ref{['thm:regretConstSuboptimalityGaps']} and \ref{['thm:lowerBoundRegret']}
  • Lemma 5.0: A divergence decomposition for biased environments; Informal statement of \ref{['lem:divergenceDecompAppendix']}
  • Claim 5.0: Consequence of the Divergence Decomposition; Simplified version of \ref{['cor:divDecompConsequenceAppendix']}
  • Lemma 5.0: Size of the small bias set
  • Lemma 5.0: A small bias set which is stable over time; informal statement of \ref{['lem:stabilityAppendix']}
  • Lemma 5.0: A derandomization of \ref{['lem:stability']}; Informal statement of \ref{['lem:mainLowerBoundIngredientsAppendix']}
  • ...and 51 more