Addressing Discretization-Induced Bias in Demographic Prediction
Evan Dong, Aaron Schein, Yixin Wang, Nikhil Garg
TL;DR
This paper reveals a fundamental bias that arises when continuous demographic predictions are discretized via argmax, showing major undercounts of minority groups in voter files and downstream distortions in analyses. It demonstrates that standard discretization rules alone are insufficient to ensure fair and accurate outcomes, even for calibrated models, and introduces a joint optimization framework (with a scalable data-driven thresholding heuristic) to align discrete decisions with a target distribution while preserving individual accuracy. Theoretical results quantify how discretization bias relates to predictive uncertainty and prove the necessity of joint decision-making for Pareto-optimal trade-offs between accuracy and fidelity. Empirically, the approach eliminates discretization bias across states, counties, and replication datasets, offering a practical path for vendors and researchers to produce less biased demographic labels and to rely on continuous scores when feasible. The work has broad implications for auditing, outreach, and fairness in demographic imputation across domains.
Abstract
Racial and other demographic imputation is necessary for many applications, especially in auditing disparities and outreach targeting in political campaigns. The canonical approach is to construct continuous predictions -- e.g., based on name and geography -- and then to $\textit{discretize}$ the predictions by selecting the most likely class (argmax). We study how this practice produces $\textit{discretization bias}$. In particular, we show that argmax labeling, as used by a prominent commercial voter file vendor to impute race/ethnicity, results in a substantial under-count of African-American voters, e.g., by 28.2% points in North Carolina. This bias can have substantial implications in downstream tasks that use such labels. We then introduce a $\textit{joint optimization}$ approach -- and a tractable $\textit{data-driven thresholding}$ heuristic -- that can eliminate this bias, with negligible individual-level accuracy loss. Finally, we theoretically analyze discretization bias, show that calibrated continuous models are insufficient to eliminate it, and that an approach such as ours is necessary. Broadly, we warn researchers and practitioners against discretizing continuous demographic predictions without considering downstream consequences.
