Table of Contents
Fetching ...

ABROCA Distributions For Algorithmic Bias Assessment: Considerations Around Interpretation

Conrad Borchers, Ryan S. Baker

TL;DR

The paper investigates ABROCA, a metric capturing threshold-specific differences in ROC performance between groups, and probes its distribution under varying conditions via simulations. By mapping $AUC$ differences to a common effect size and simulating univariate normal predictors with logistic regression, the authors show ABROCA can be inflated by chance when subgroup AUCs are similar and sample sizes are small, and that ABROCA converges to $AUC_1 - AUC_2$ as differences grow or samples become larger. They further demonstrate that imbalanced outcome and minority group sizes exacerbate skew and uncertainty, underscoring the need for careful interpretation or statistical calibration when using ABROCA to assess bias in learning analytics. The work contributes open-source code for simulation-based evaluation and highlights practical implications for applying ABROCA in education contexts, suggesting thresholds and significance considerations for reliable bias assessment.

Abstract

Algorithmic bias continues to be a key concern of learning analytics. We study the statistical properties of the Absolute Between-ROC Area (ABROCA) metric. This fairness measure quantifies group-level differences in classifier performance through the absolute difference in ROC curves. ABROCA is particularly useful for detecting nuanced performance differences even when overall Area Under the ROC Curve (AUC) values are similar. We sample ABROCA under various conditions, including varying AUC differences and class distributions. We find that ABROCA distributions exhibit high skewness dependent on sample sizes, AUC differences, and class imbalance. When assessing whether a classifier is biased, this skewness inflates ABROCA values by chance, even when data is drawn (by simulation) from populations with equivalent ROC curves. These findings suggest that ABROCA requires careful interpretation given its distributional properties, especially when used to assess the degree of bias and when classes are imbalanced.

ABROCA Distributions For Algorithmic Bias Assessment: Considerations Around Interpretation

TL;DR

The paper investigates ABROCA, a metric capturing threshold-specific differences in ROC performance between groups, and probes its distribution under varying conditions via simulations. By mapping differences to a common effect size and simulating univariate normal predictors with logistic regression, the authors show ABROCA can be inflated by chance when subgroup AUCs are similar and sample sizes are small, and that ABROCA converges to as differences grow or samples become larger. They further demonstrate that imbalanced outcome and minority group sizes exacerbate skew and uncertainty, underscoring the need for careful interpretation or statistical calibration when using ABROCA to assess bias in learning analytics. The work contributes open-source code for simulation-based evaluation and highlights practical implications for applying ABROCA in education contexts, suggesting thresholds and significance considerations for reliable bias assessment.

Abstract

Algorithmic bias continues to be a key concern of learning analytics. We study the statistical properties of the Absolute Between-ROC Area (ABROCA) metric. This fairness measure quantifies group-level differences in classifier performance through the absolute difference in ROC curves. ABROCA is particularly useful for detecting nuanced performance differences even when overall Area Under the ROC Curve (AUC) values are similar. We sample ABROCA under various conditions, including varying AUC differences and class distributions. We find that ABROCA distributions exhibit high skewness dependent on sample sizes, AUC differences, and class imbalance. When assessing whether a classifier is biased, this skewness inflates ABROCA values by chance, even when data is drawn (by simulation) from populations with equivalent ROC curves. These findings suggest that ABROCA requires careful interpretation given its distributional properties, especially when used to assess the degree of bias and when classes are imbalanced.

Paper Structure

This paper contains 18 sections, 4 figures, 1 algorithm.

Figures (4)

  • Figure 1: ABROCA distributions under no AUC difference for different test set sample sizes, including median point estimates and 95% confidence intervals based on repeated simulations (left) and in relationship to AUC1-AUC2, for small (500-3,500; blue), medium (3500-6,000; orange), and large (6,000-9,500; green) test set sample sizes (right).
  • Figure 2: Distribution of the ABROCA statistic under different true AUC differences for different test set sample sizes, including median point estimates and 95% confidence intervals based on repeated simulations.
  • Figure 3: Sampled ABROCA values across three sample sizes and two population AUC differences for demonstration.
  • Figure 4: Histograms of sampled ABROCA values under balanced (50%) and imbalanced (90%) outcome and minority group classes as well as equal (0.8) and lower (0.6 vs. 0.8) AUC values in the minority group for 5,000 observations.