OxonFair: A Flexible Toolkit for Algorithmic Fairness

Eoin Delaney; Zihao Fu; Sandra Wachter; Brent Mittelstadt; Chris Russell

OxonFair: A Flexible Toolkit for Algorithmic Fairness

Eoin Delaney, Zihao Fu, Sandra Wachter, Brent Mittelstadt, Chris Russell

TL;DR

OxonFair is a new open source toolkit for enforcing fairness in binary classification that supports NLP and Computer Vision classification as well as standard tabular problems and can optimize any measure based on True Positives, False Positive, False Negatives, and True Negatives.

Abstract

We present OxonFair, a new open source toolkit for enforcing fairness in binary classification. Compared to existing toolkits: (i) We support NLP and Computer Vision classification as well as standard tabular problems. (ii) We support enforcing fairness on validation data, making us robust to a wide range of overfitting challenges. (iii) Our approach can optimize any measure based on True Positives, False Positive, False Negatives, and True Negatives. This makes it easily extensible and much more expressive than existing toolkits. It supports all 9 and all 10 of the decision-based group metrics of two popular review articles. (iv) We jointly optimize a performance objective alongside fairness constraints. This minimizes degradation while enforcing fairness, and even improves the performance of inadequately tuned unfair baselines. OxonFair is compatible with standard ML toolkits, including sklearn, Autogluon, and PyTorch and is available at https://github.com/oxfordinternetinstitute/oxonfair

OxonFair: A Flexible Toolkit for Algorithmic Fairness

TL;DR

Abstract

Paper Structure (45 sections, 7 equations, 16 figures, 23 tables)

This paper contains 45 sections, 7 equations, 16 figures, 23 tables.

Introduction
Related Work
Fairness Toolkits
Specialist solvers
Toolkit interface
Inference
Efficient grid sampling
Inferred characteristics
Fairness for Deep Networks
Toolkit expressiveness
Experimental Analysis
Computer Vision and CelebA
Implementation Details
Results:
NLP and Toxic Content
...and 30 more sections

Figures (16)

Figure 1: Left: The need for an objective when enforcing fairness. We evaluate a range of methods with respect to balanced accuracy and demographic parity (OxonFair generates a frontier of solutions). Only OxonFair and RejectOptimization optimize balanced accuracy. As we improve the balanced accuracy of fair methods by adjusting classification thresholds (gray lines) fairness deteriorates. To avoid this, we jointly optimize a fairness measure and an objective. For more examples, see \ref{['figure:overfit']}. Right Top: Using validation data in fairness. We compare against Fairlearn using standard algorithms with default parameters. These methods perfectly overfit and show no unfairness with respect to equal opportunity on the trainset, but substantial unfairness on test. OxonFair enforces fairness on held-out validation data and is less prone to overfitting. Right Bottom: A comparison of toolkits. AIF360 offers a large range of tabular methods, most of which do not allow fairness metric selection, Fairlearn offers fewer but more customizable tabular methods. OxonFair offers one method that can be applied to text, image, and tabular data, while supporting more notions of fairness and objectives.
Figure 2: Left: Summary of the fast path algorithm for inferred attributes (\ref{['inferred']}). Groups are noisily estimated using a classifier. Within each estimated group, we cumulatively sum positive and negative samples that truly belong to each group. For each pair of thresholds, we select relevant sums from the inferred group and combine them. See \ref{['sec:fast']}. Center: Combining two heads (original classifier and group predictor) to create a fair classifier. See \ref{['sec:enforcedeep']}. Right: The output of a second head predicting the protected attribute in CelebA. The pronounced bimodal distribution makes the weighted sum of the two heads a close replacement for per-group thresholds.
Figure 3: Left: Results on Compas without using group annotations at test time. Right: Runtime Comparison for Fairlearn Reductions and OxonFair on Adult using a Macbook M2. To alter the groups, we iteratively merge the smallest racial group with 'Other', reducing the search space. For both methods, we enforced demographic parity over a train set consisting of 70% of the data. Despite the exponential complexity of our approach, we remain significantly faster until we reach 5 groups. The 0.6+ indicates the seconds to train XGBoost. OxonFair(S) indicates the runtime of the naive slow pathway described in \ref{['sec:slow']} rather than our accelerated approach.
Figure 4: Left: The Pareto frontier of min. group recall vs. accuracy on Blond Hair demonstrates OxonFair's superior performance. Right: Comparing accuracy of fairness methods on 26 CelebA attributes while varying global decision thresholds to increase the minimum group recall level to $\delta$.
Figure 5: The Pareto frontier on test data when enforcing two fairness measures (DEO and Min Group Min Label Acc; see \ref{['sec:minimax']}) for the Earrings attribute. Inspecting the Pareto frontier shows a wide range of solutions, including some that improve fairness while retaining similar accuracy.
...and 11 more figures

OxonFair: A Flexible Toolkit for Algorithmic Fairness

TL;DR

Abstract

OxonFair: A Flexible Toolkit for Algorithmic Fairness

Authors

TL;DR

Abstract

Table of Contents

Figures (16)