Testing the Fairness-Accuracy Improvability of Algorithms

Eric Auerbach; Annie Liang; Kyohei Okumura; Max Tabord-Meehan

Testing the Fairness-Accuracy Improvability of Algorithms

Eric Auerbach, Annie Liang, Kyohei Okumura, Max Tabord-Meehan

TL;DR

The paper formalizes and tests the possibility of improving an algorithm's fairness without sacrificing accuracy by introducing an econometric framework for $(\Delta_r,\Delta_b,\Delta_f)$-improvability. It defines a flexible, legally cognizant objective space using group-specific accuracy utilities $U_A^g(a)$ and a two-sided fairness measure $|U_F^r(a)-U_F^b(a)|$, then proposes a data-splitting, bootstrap-based procedure to test whether a status-quo algorithm is improvable within a chosen algorithm class $\mathcal{A}$. The authors prove asymptotic validity and, under an improvement-convergence condition, consistency; they also show that repeated sample-splitting is more robust to manipulation than a single split. The empirical application to a healthcare algorithm (Obermeyer et al.) demonstrates that substantial fairness improvements are possible without reducing predictive accuracy, illustrating the approach’s practical relevance for Title VI regulation of federally funded programs. Overall, the framework provides regulators and practitioners with a transparent, flexible tool to substantiate or refute the necessity defense by testing for simultaneous improvements along fairness and accuracy criteria.

Abstract

Many organizations use algorithms that have a disparate impact, i.e., the benefits or harms of the algorithm fall disproportionately on certain social groups. Addressing an algorithm's disparate impact can be challenging, however, because it is often unclear whether it is possible to reduce this impact without sacrificing other objectives of the organization, such as accuracy or profit. Establishing the improvability of algorithms with respect to multiple criteria is of both conceptual and practical interest: in many settings, disparate impact that would otherwise be prohibited under US federal law is permissible if it is necessary to achieve a legitimate business interest. The question is how a policy-maker can formally substantiate, or refute, this "necessity" defense. In this paper, we provide an econometric framework for testing the hypothesis that it is possible to improve on the fairness of an algorithm without compromising on other pre-specified objectives. Our proposed test is simple to implement and can be applied under any exogenous constraint on the algorithm space. We establish the large-sample validity and consistency of our test, and microfound the test's robustness to manipulation based on a game between a policymaker and the analyst. Finally, we apply our approach to evaluate a healthcare algorithm originally considered by Obermeyer et al. (2019), and quantify the extent to which the algorithm's disparate impact can be reduced without compromising the accuracy of its predictions.

Testing the Fairness-Accuracy Improvability of Algorithms

TL;DR

The paper formalizes and tests the possibility of improving an algorithm's fairness without sacrificing accuracy by introducing an econometric framework for

-improvability. It defines a flexible, legally cognizant objective space using group-specific accuracy utilities

and a two-sided fairness measure

, then proposes a data-splitting, bootstrap-based procedure to test whether a status-quo algorithm is improvable within a chosen algorithm class

. The authors prove asymptotic validity and, under an improvement-convergence condition, consistency; they also show that repeated sample-splitting is more robust to manipulation than a single split. The empirical application to a healthcare algorithm (Obermeyer et al.) demonstrates that substantial fairness improvements are possible without reducing predictive accuracy, illustrating the approach’s practical relevance for Title VI regulation of federally funded programs. Overall, the framework provides regulators and practitioners with a transparent, flexible tool to substantiate or refute the necessity defense by testing for simultaneous improvements along fairness and accuracy criteria.

Abstract

Paper Structure (41 sections, 14 theorems, 140 equations, 7 figures, 6 tables)

This paper contains 41 sections, 14 theorems, 140 equations, 7 figures, 6 tables.

Introduction
Legal framework
Related Literature
Model
Setup
Accuracy/Fairness Improvability
Proposed Approach
Description of Procedure
Testing whether $\hat{a}^{\rho}_{1n}$ constitutes a $(\Delta_r,\Delta_{b},\Delta_f)$-improvement on $a_0$.
Main Results
Microfoundation for Repeated Sample-Splitting
Model
Results
Empirical Application
Data and Classification Problem
...and 26 more sections

Key Result

Theorem 4.1

Suppose $P$ and $\mathcal{A}$ satisfy Assumptions assp:n, assp:Utilities and assp:NonDegenerate, and suppose the null hypothesis given in eq:H0_full holds. Then

Figures (7)

Figure 1: A summary of our procedure for testing $(\Delta_r,\Delta_b,\Delta_f)$-improvability.
Figure 2: We report the average value of $U^g$ (i.e., number of active chronic diseases conditional on automatic enrollment) for each group $g$ across the $K=7$ iterations of our procedure. The algorithm is more accurate for group $g$ when $U^g$ is larger; so, moving towards the upper-right quadrant corresponds to improvements in accuracy. The algorithm is more fair when its $(U^b, U^w)$ pair is closer to the 45-degree line (corresponding to more balanced accuracy across the patient groups). This figure suggests that the candidate algorithms based on linear regression, LASSO, and random forest improve on both fairness and accuracy over the status quo algorithm used by the hospital.
Figure 3: We present $p$-values for testing $(\delta_a, \delta_a, \delta_f)$-improvability for $(\delta_a, \delta_f) \in [-1,1] \times [0,1]$. The selection rule is based on random forests, with our procedure setting $K = 7$. Both accuracy and fairness utilities measure the expected health needs of those selected into the program. Larger values of $\delta_a$ and $\delta_f$ indicate a stricter test regarding the dimensions of fairness and accuracy, respectively. The horizontal line at $\delta_a = 0$ crosses the $p=0.025$ contour at $\delta_f = 0.64$.
Figure 4: The analyst's solution satisfies $m_2^* \leq \overline{m}$ for some finite $\overline{m}$.
Figure 5: Rejection probabilities for various levels of candidate fairness, for sample sizes $\ell_n \in \{100, 200, 400\}$.
...and 2 more figures

Theorems & Definitions (37)

Definition 2.1
Remark 1: Alternative definitions
Remark 2: Two-sided fairness measure
Remark 3: Implications for social welfare
Example 1: Classification Rate
Example 2: Calibration
Example 3: False Positive Rate
Example 4: Profit
Definition 2.2: Accuracy-Fairness Improvement
Definition 2.3: FA-dominance
...and 27 more

Testing the Fairness-Accuracy Improvability of Algorithms

TL;DR

Abstract

Testing the Fairness-Accuracy Improvability of Algorithms

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (37)