ABROCA Distributions For Algorithmic Bias Assessment: Considerations Around Interpretation
Conrad Borchers, Ryan S. Baker
TL;DR
The paper investigates ABROCA, a metric capturing threshold-specific differences in ROC performance between groups, and probes its distribution under varying conditions via simulations. By mapping $AUC$ differences to a common effect size and simulating univariate normal predictors with logistic regression, the authors show ABROCA can be inflated by chance when subgroup AUCs are similar and sample sizes are small, and that ABROCA converges to $AUC_1 - AUC_2$ as differences grow or samples become larger. They further demonstrate that imbalanced outcome and minority group sizes exacerbate skew and uncertainty, underscoring the need for careful interpretation or statistical calibration when using ABROCA to assess bias in learning analytics. The work contributes open-source code for simulation-based evaluation and highlights practical implications for applying ABROCA in education contexts, suggesting thresholds and significance considerations for reliable bias assessment.
Abstract
Algorithmic bias continues to be a key concern of learning analytics. We study the statistical properties of the Absolute Between-ROC Area (ABROCA) metric. This fairness measure quantifies group-level differences in classifier performance through the absolute difference in ROC curves. ABROCA is particularly useful for detecting nuanced performance differences even when overall Area Under the ROC Curve (AUC) values are similar. We sample ABROCA under various conditions, including varying AUC differences and class distributions. We find that ABROCA distributions exhibit high skewness dependent on sample sizes, AUC differences, and class imbalance. When assessing whether a classifier is biased, this skewness inflates ABROCA values by chance, even when data is drawn (by simulation) from populations with equivalent ROC curves. These findings suggest that ABROCA requires careful interpretation given its distributional properties, especially when used to assess the degree of bias and when classes are imbalanced.
