On Mitigating Affinity Bias through Bandits with Evolving Biased Feedback
Matthew Faw, Constantine Caramanis, Jessica Hoffmann
TL;DR
The paper studies affinity bias in sequential decision making by introducing affinity bandits, a non-stationary bandit variant where observed feedback is biased by the fraction of times arms are chosen. It formalizes a bias model with $W_i(t)=f\left(\frac{T^0_i+T_i(t-1)}{t_0^{\text{bias}}+t-1}\right)$ and shows that standard algorithms fail due to the evolving bias, even when real rewards remain fixed. The authors prove a new instance-dependent lower bound that scales with the number of arms $K$ and design a near-optimal elimination-style algorithm that achieves sublinear regret despite unknown bias, using a novel divergence-based analysis extended to time-varying, policy-dependent feedback. They also provide an asymptotic lower bound showing that the bias can inflate regret by a multiplicative factor tied to $f$ and $K$, establishing near-tightness of their upper bounds. Overall, the work offers both fundamental limits and practical strategies for mitigating affinity bias in hiring-like sequential decision problems with evolving biased feedback.
Abstract
Unconscious bias has been shown to influence how we assess our peers, with consequences for hiring, promotions and admissions. In this work, we focus on affinity bias, the component of unconscious bias which leads us to prefer people who are similar to us, despite no deliberate intention of favoritism. In a world where the people hired today become part of the hiring committee of tomorrow, we are particularly interested in understanding (and mitigating) how affinity bias affects this feedback loop. This problem has two distinctive features: 1) we only observe the biased value of a candidate, but we want to optimize with respect to their real value 2) the bias towards a candidate with a specific set of traits depends on the fraction of people in the hiring committee with the same set of traits. We introduce a new bandits variant that exhibits those two features, which we call affinity bandits. Unsurprisingly, classical algorithms such as UCB often fail to identify the best arm in this setting. We prove a new instance-dependent regret lower bound, which is larger than that in the standard bandit setting by a multiplicative function of $K$. Since we treat rewards that are time-varying and dependent on the policy's past actions, deriving this lower bound requires developing proof techniques beyond the standard bandit techniques. Finally, we design an elimination-style algorithm which nearly matches this regret, despite never observing the real rewards.
