Observing Context Improves Disparity Estimation when Race is Unobserved

Kweku Kwegyir-Aggrey; Naveen Durvasula; Jennifer Wang; Suresh Venkatasubramanian

Observing Context Improves Disparity Estimation when Race is Unobserved

Kweku Kwegyir-Aggrey, Naveen Durvasula, Jennifer Wang, Suresh Venkatasubramanian

TL;DR

The paper tackles the challenge of estimating racial disparities when individual race data are unavailable, highlighting biases in standard proxy methods like BISG. It introduces two contextual proxy approaches, $cBISG$ and $MICSG$, and a Bayes estimator for disparity that achieves unbiased estimates under a mean-consistency condition. Through large-scale experiments on HMDA mortgage data and North Carolina voter data, the authors show that contextual proxies yield more accurate race predictions and disparity estimates, with reduced mean-consistency violations for minority groups. The work provides a practical pathway to more reliable disparity estimation in settings where direct race data are difficult to obtain or legally constrained, by leveraging contextual information and calibration-based guarantees.

Abstract

In many domains, it is difficult to obtain the race data that is required to estimate racial disparity. To address this problem, practitioners have adopted the use of proxy methods which predict race using non-protected covariates. However, these proxies often yield biased estimates, especially for minority groups, limiting their real-world utility. In this paper, we introduce two new contextual proxy models that advance existing methods by incorporating contextual features in order to improve race estimates. We show that these algorithms demonstrate significant performance improvements in estimating disparities on real-world home loan and voter data. We establish that achieving unbiased disparity estimates with contextual proxies relies on mean-consistency, a calibration-like condition.

Observing Context Improves Disparity Estimation when Race is Unobserved

TL;DR

and

, and a Bayes estimator for disparity that achieves unbiased estimates under a mean-consistency condition. Through large-scale experiments on HMDA mortgage data and North Carolina voter data, the authors show that contextual proxies yield more accurate race predictions and disparity estimates, with reduced mean-consistency violations for minority groups. The work provides a practical pathway to more reliable disparity estimation in settings where direct race data are difficult to obtain or legally constrained, by leveraging contextual information and calibration-based guarantees.

Abstract

Paper Structure (15 sections, 4 theorems, 26 equations, 6 figures)

This paper contains 15 sections, 4 theorems, 26 equations, 6 figures.

Contributions.
Problems with Undercounting.
Algorithm: Contextual Bayesian Surname Geocoding.
Is the Hyperparameter Necessary?
Algorithm: Machine Learning Improved Contextual Surname Geocoding.
Dataset.
Methods.
Results.
Dataset.
Methods.
Results.
Proofs
Proof of Theorem \ref{['thm: unbiased']}.
Proof of Theorem \ref{['thm:mean consistent_to_bias']}.
Proof of Theorem \ref{['thm:bias_to_consistent']}.

Key Result

Theorem 1

Figures (6)

Figure 1: A toy example illustrating the effect of the $\eta$ hyperparameter on the posterior distribution (over distributions) for three races. Figure \ref{['fig:subfiga']} depicts the prior $\text{Dir}(5,3,2)$, which assigns higher mass to distributions that place a larger probability for observing white individuals, compared to other groups. Now, suppose that in some supplemental data, we observe 2 White, 3 Black, and 1 Hispanic individual (the corresponding empirical distribution is indicated with the black dot). Figure \ref{['fig:subfigb']} depicts the resulting posterior when $\eta = 0$, effectively assuming a uniform prior (rather than our census prior) thereby by assigning a higher likelihood to distributions that are majority black, as a consequence of the supplemental data. Figure \ref{['fig:subfigc']} depicts the conjugate posterior for $\eta = 0.25$, where this posterior balances the census and supplemental race distributions.
Figure 2: An overview of the cBISG algorithm for computing a contextual proxy.
Figure 3: An overview of the MICSG algorithm for computing a contextual proxy.
Figure 4: The loan approval rate per racial group is given by the x-axis. A dot closer to the HMDA dot implies more accurate disparity estimation. MICSG variants outperform BISG across all groups.
Figure 5: We show the mean consistency violations for BISG and MICSG. The x-axis denotes some true proportion of individuals per racial group, in some geography, who received loans. The size of each dot denotes the size of the bins indicated on the x-axis. The y-axis denotes the mean consistency violation of the corresponding proxy model. Being close to the horizontal $x=0$ line indicates good performance. MICSG leads to smaller violations compared to BISG across all groups.
...and 1 more figures

Theorems & Definitions (9)

Definition 1: Positive Rate
Theorem 1: Theorem 3.1 of chen_fairness_2019
Definition 2: $\epsilon$-Mean Consistency
Theorem 2
Theorem 3
Theorem 4
proof
proof
proof

Observing Context Improves Disparity Estimation when Race is Unobserved

TL;DR

Abstract

Observing Context Improves Disparity Estimation when Race is Unobserved

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (9)