FairlyUncertain: A Comprehensive Benchmark of Uncertainty in Algorithmic Fairness

Lucas Rosenblatt; R. Teal Witter

FairlyUncertain: A Comprehensive Benchmark of Uncertainty in Algorithmic Fairness

Lucas Rosenblatt, R. Teal Witter

TL;DR

This work introduces FairlyUncertain, an axiomatic benchmark for evaluating uncertainty estimates in fairness, which posits that fair predictive uncertainty estimates should be consistent across learning pipelines and calibrated to observed randomness.

Abstract

Fair predictive algorithms hinge on both equality and trust, yet inherent uncertainty in real-world data challenges our ability to make consistent, fair, and calibrated decisions. While fairly managing predictive error has been extensively explored, some recent work has begun to address the challenge of fairly accounting for irreducible prediction uncertainty. However, a clear taxonomy and well-specified objectives for integrating uncertainty into fairness remains undefined. We address this gap by introducing FairlyUncertain, an axiomatic benchmark for evaluating uncertainty estimates in fairness. Our benchmark posits that fair predictive uncertainty estimates should be consistent across learning pipelines and calibrated to observed randomness. Through extensive experiments on ten popular fairness datasets, our evaluation reveals: (1) A theoretically justified and simple method for estimating uncertainty in binary settings is more consistent and calibrated than prior work; (2) Abstaining from binary predictions, even with improved uncertainty estimates, reduces error but does not alleviate outcome imbalances between demographic groups; (3) Incorporating consistent and calibrated uncertainty estimates in regression tasks improves fairness without any explicit fairness interventions. Additionally, our benchmark package is designed to be extensible and open-source, to grow with the field. By providing a standardized framework for assessing the interplay between uncertainty and fairness, FairlyUncertain paves the way for more equitable and trustworthy machine learning practices.

FairlyUncertain: A Comprehensive Benchmark of Uncertainty in Algorithmic Fairness

TL;DR

Abstract

Paper Structure (20 sections, 9 equations, 20 figures, 12 tables)

This paper contains 20 sections, 9 equations, 20 figures, 12 tables.

Introduction
Preliminaries
Axioms
FairlyUncertain : A Benchmark for Uncertainty in Fairness
Consistency and Calibration
Models and Classification Tasks
Regression Tasks
Abstaining on Classification Tasks
Uncertainty Aware Fair Regression
Related Work
Conclusion
Datasets Included in FairlyUncertain
Additional Results on Consistency and Calibration
Why to Use Binomial NLL for Binary Classification Uncertainty
Additional experiments
...and 5 more sections

Figures (20)

Figure 1: Two distributions over observable outcomes. For example, Distribution A can represent the test scores of a student in a stable home whereas Distribution B can represent the test scores of a student in an unstable home. While both have the same mean and $80\%$ confidence interval, the distributions are substantially different as captured by the standard deviation.
Figure 2: This boxplot shows the standard deviation of each individual's uncertainty estimates across different max_depth hyperparameter settings. For example, if an individual has the same uncertainty estimate for each hyper-parameter setting, then their standard deviation is 0 (perfect consistency) whereas if they vary wildly, the standard deviation is high (not consistent). The Binomial NLL and Ensemble methods exhibits are the most consistent.
Figure 3: For five groups assembled by predicted uncertainty, we plot the average predicted uncertainty against the empirical standard deviation of the outcomes. An algorithm is perfectly calibrated if predicted uncertainty equals the empirical standard deviation i.e., the points lie on the dashed identity line. Note that uncertainty estimates do not always represent variance, so we expect a positive but not necessarily linear correlation. Additionally, note that this calibration graph also reflects consistency; a less consistent method will have a more arbitrary grouping leading to a flatter observed slope.
Figure 4: Abstaining has no reliable effect on Statistical Parity (comparable to the Random baseline).
Figure 5: For an abstention rate $r$, FairlyUncertain abstains on the $r$ fraction of observations with the highest uncertainty. For heteroscedastic uncertainty methods, predictions become more accurate as the model abstains more, while the error rate for the random baseline remains steady.
...and 15 more figures

Theorems & Definitions (10)

Definition 2.1: Learning Pipeline
Definition 2.2: Similar Learning Pipelines
Definition 6.1: Statistical Parity
Definition 6.2: Uncertainty-Aware Statistical Parity (UA-SP)
Definition G.1: Statistical Parity
Definition G.2: Equalized Odds
Definition G.3: Equal Opportunity
Definition G.4: Disparate Impact
Definition G.5: Predictive Parity
Definition G.6: False Positive Rate Equality

FairlyUncertain: A Comprehensive Benchmark of Uncertainty in Algorithmic Fairness

TL;DR

Abstract

FairlyUncertain: A Comprehensive Benchmark of Uncertainty in Algorithmic Fairness

Authors

TL;DR

Abstract

Table of Contents

Figures (20)

Theorems & Definitions (10)