Table of Contents
Fetching ...

Addressing Discretization-Induced Bias in Demographic Prediction

Evan Dong, Aaron Schein, Yixin Wang, Nikhil Garg

TL;DR

This paper reveals a fundamental bias that arises when continuous demographic predictions are discretized via argmax, showing major undercounts of minority groups in voter files and downstream distortions in analyses. It demonstrates that standard discretization rules alone are insufficient to ensure fair and accurate outcomes, even for calibrated models, and introduces a joint optimization framework (with a scalable data-driven thresholding heuristic) to align discrete decisions with a target distribution while preserving individual accuracy. Theoretical results quantify how discretization bias relates to predictive uncertainty and prove the necessity of joint decision-making for Pareto-optimal trade-offs between accuracy and fidelity. Empirically, the approach eliminates discretization bias across states, counties, and replication datasets, offering a practical path for vendors and researchers to produce less biased demographic labels and to rely on continuous scores when feasible. The work has broad implications for auditing, outreach, and fairness in demographic imputation across domains.

Abstract

Racial and other demographic imputation is necessary for many applications, especially in auditing disparities and outreach targeting in political campaigns. The canonical approach is to construct continuous predictions -- e.g., based on name and geography -- and then to $\textit{discretize}$ the predictions by selecting the most likely class (argmax). We study how this practice produces $\textit{discretization bias}$. In particular, we show that argmax labeling, as used by a prominent commercial voter file vendor to impute race/ethnicity, results in a substantial under-count of African-American voters, e.g., by 28.2% points in North Carolina. This bias can have substantial implications in downstream tasks that use such labels. We then introduce a $\textit{joint optimization}$ approach -- and a tractable $\textit{data-driven thresholding}$ heuristic -- that can eliminate this bias, with negligible individual-level accuracy loss. Finally, we theoretically analyze discretization bias, show that calibrated continuous models are insufficient to eliminate it, and that an approach such as ours is necessary. Broadly, we warn researchers and practitioners against discretizing continuous demographic predictions without considering downstream consequences.

Addressing Discretization-Induced Bias in Demographic Prediction

TL;DR

This paper reveals a fundamental bias that arises when continuous demographic predictions are discretized via argmax, showing major undercounts of minority groups in voter files and downstream distortions in analyses. It demonstrates that standard discretization rules alone are insufficient to ensure fair and accurate outcomes, even for calibrated models, and introduces a joint optimization framework (with a scalable data-driven thresholding heuristic) to align discrete decisions with a target distribution while preserving individual accuracy. Theoretical results quantify how discretization bias relates to predictive uncertainty and prove the necessity of joint decision-making for Pareto-optimal trade-offs between accuracy and fidelity. Empirically, the approach eliminates discretization bias across states, counties, and replication datasets, offering a practical path for vendors and researchers to produce less biased demographic labels and to rely on continuous scores when feasible. The work has broad implications for auditing, outreach, and fairness in demographic imputation across domains.

Abstract

Racial and other demographic imputation is necessary for many applications, especially in auditing disparities and outreach targeting in political campaigns. The canonical approach is to construct continuous predictions -- e.g., based on name and geography -- and then to the predictions by selecting the most likely class (argmax). We study how this practice produces . In particular, we show that argmax labeling, as used by a prominent commercial voter file vendor to impute race/ethnicity, results in a substantial under-count of African-American voters, e.g., by 28.2% points in North Carolina. This bias can have substantial implications in downstream tasks that use such labels. We then introduce a approach -- and a tractable heuristic -- that can eliminate this bias, with negligible individual-level accuracy loss. Finally, we theoretically analyze discretization bias, show that calibrated continuous models are insufficient to eliminate it, and that an approach such as ours is necessary. Broadly, we warn researchers and practitioners against discretizing continuous demographic predictions without considering downstream consequences.
Paper Structure (34 sections, 4 theorems, 28 equations, 16 figures, 5 tables)

This paper contains 34 sections, 4 theorems, 28 equations, 16 figures, 5 tables.

Key Result

Theorem 1

Argmax bias depends on predictive uncertainty, i.e., how much information features $x$ provide about the true class label. Consider calibrated classifier $q$ and the argmax decision rule $D_{\textrm{argmax}}$. Let $N>K$ and consider a reference distribution of either the aggregate posterior or the p

Figures (16)

  • Figure 1: Comparison of different discretization methods. Each subfigure shows a 3-dimensional probability simplex, where individual points are colored according to the label assigned by the corresponding method. For example, in (a), all the blue points are assigned the Caucasian label, which has the highest class probability according to the continuous model $q$ for that data point. We use a sample of points from the voter file, and Hispanic, Asian, and Native American probabilities are aggregated into Other. Our approach in \ref{['fig:simplex-matching']} (posterior matching) matches the class distribution while maintaining individual data point level accuracy.
  • Figure 2: Bias (undercounting of voters of color) in the voter file. (a) In North Carolina where ground truth self-reported data is available, the difference in counts for each group between each discretization method and the ground truth. The argmax method substantially undercounts African Americans in particular, with threshold methods further magnifying such bias. The bar marked "Aggregate Posterior" corresponds to the bias of both Thompson Sampling and Aggregate method; as these methods directly reflect proportions from the continuous model, its undercounting is due to the model's miscalibration. The additional bias of argmax is thus the bias caused by discretizing the model scores. (b) In all states plus Washington, DC, the discretization bias (difference between the fraction of voters of color in the discrete labels and the aggregation posterior fraction). Points below the horizontal line indicate a comparative underrepresentation of voters of color compared to the aggregate posterior -- only in DC, Hawaii, and New Mexico does the discretization lead to an increase of the count of voters of color. Note that in DC and Hawaii, Caucasian is not the most common class: African Americans (DC) and Asian (HI) are, respectively, and these classes are over-represented by argmax. New Mexico (NM) is the one exception where the argmax decision rule under-represents the most common class compared to the aggregate posterior. In (b), some similarly clustered states are left unlabeled for visual clarity, and the overall effect in full voter file is marked in orange. In (c), we plot the performance of different decision rules according to our two metrics. Notice that our optimization-based rules Pareto-dominate the sampling approach, and that the data-driven threshold approximates the full curve of integer programs quite closely.
  • Figure 3: Per-county Caucasian bias in North Carolina, under different discretization methods -- the underrepresentation of voters of color (i.e., bias overrepresenting the white population) when compared to the aggregate posterior. The most commonly used methods of thresholding and argmax further cause geographic skews -- in many parts of the state, very few rows are classified as non-Caucasian, and the bias is largest in counties with an already skewed population. In contrast, Thompson sampling and County-conditional aggregate posterior matching have no bias when compared to the aggregate posterior.
  • Figure 4: In simulation, discretized performance and argmax bias of the Bayes optimal classifier as a function of model accuracy, labeled by generating parameter $p$ and $\text{MAE}$. The results reflect \ref{['thm:informationargmax']}: as classifier accuracy increases ($\text{MAE}$ decreases), argmax bias decreases, and both accuracy and distribution fidelity of optimal rules increase. Simulation details are described in \ref{['sec:simulationsetup']}.
  • Figure 5: In simulation, the performance of various commonly used decision rules. The results illustrate \ref{['thm:pareto']}: the argmax rule maximizes accuracy, but is the only Pareto optimal independent rule. Simulation details and decision rules are described in \ref{['sec:simulationsetup']}.
  • ...and 11 more figures

Theorems & Definitions (6)

  • Theorem 1
  • Theorem 2
  • Theorem 2
  • proof
  • Theorem 2
  • proof