Table of Contents
Fetching ...

Fairness Issues and Mitigations in (Differentially Private) Socio-Demographic Data Processes

Joonhyuk Ko, Juba Ziani, Saswat Das, Matt Williams, Ferdinando Fioretto

TL;DR

The paper tackles fairness issues in socio-demographic data collection by showing that standard proportional sampling can yield uneven group-level errors that bias downstream decisions. It develops an optimization framework that exploits a two-phase survey design to minimize costs while enforcing per-group error bounds, supported by tractable error quantification via Chebyshev bounds and empirical variance proxies. It also analyzes how differential privacy affects sampling fairness, revealing that DP noise can positively bias minority counts and, counterintuitively, reduce disparities in some settings; a heuristic for near-equal subgroup sampling further supports fair allocations. Extensive experiments on ACS data (notably Connecticut) demonstrate that the optimized two-phase approach improves fairness across groups while controlling costs, and reveal nuanced DP effects on sampling that inform privacy-utility tradeoffs in real-world census processes. Overall, the work provides a principled, practical framework for fair and cost-efficient survey design under privacy constraints, with broad implications for policy-relevant data collection.

Abstract

Statistical agencies rely on sampling techniques to collect socio-demographic data crucial for policy-making and resource allocation. This paper shows that surveys of important societal relevance introduce sampling errors that unevenly impact group-level estimates, thereby compromising fairness in downstream decisions. To address these issues, this paper introduces an optimization approach modeled on real-world survey design processes, ensuring sampling costs are optimized while maintaining error margins within prescribed tolerances. Additionally, privacy-preserving methods used to determine sampling rates can further impact these fairness issues. This paper explores the impact of differential privacy on the statistics informing the sampling process, revealing a surprising effect: not only is the expected negative effect from the addition of noise for differential privacy negligible, but also this privacy noise can in fact reduce unfairness as it positively biases smaller counts. These findings are validated over an extensive analysis using datasets commonly applied in census statistics.

Fairness Issues and Mitigations in (Differentially Private) Socio-Demographic Data Processes

TL;DR

The paper tackles fairness issues in socio-demographic data collection by showing that standard proportional sampling can yield uneven group-level errors that bias downstream decisions. It develops an optimization framework that exploits a two-phase survey design to minimize costs while enforcing per-group error bounds, supported by tractable error quantification via Chebyshev bounds and empirical variance proxies. It also analyzes how differential privacy affects sampling fairness, revealing that DP noise can positively bias minority counts and, counterintuitively, reduce disparities in some settings; a heuristic for near-equal subgroup sampling further supports fair allocations. Extensive experiments on ACS data (notably Connecticut) demonstrate that the optimized two-phase approach improves fairness across groups while controlling costs, and reveal nuanced DP effects on sampling that inform privacy-utility tradeoffs in real-world census processes. Overall, the work provides a principled, practical framework for fair and cost-efficient survey design under privacy constraints, with broad implications for policy-relevant data collection.

Abstract

Statistical agencies rely on sampling techniques to collect socio-demographic data crucial for policy-making and resource allocation. This paper shows that surveys of important societal relevance introduce sampling errors that unevenly impact group-level estimates, thereby compromising fairness in downstream decisions. To address these issues, this paper introduces an optimization approach modeled on real-world survey design processes, ensuring sampling costs are optimized while maintaining error margins within prescribed tolerances. Additionally, privacy-preserving methods used to determine sampling rates can further impact these fairness issues. This paper explores the impact of differential privacy on the statistics informing the sampling process, revealing a surprising effect: not only is the expected negative effect from the addition of noise for differential privacy negligible, but also this privacy noise can in fact reduce unfairness as it positively biases smaller counts. These findings are validated over an extensive analysis using datasets commonly applied in census statistics.
Paper Structure (33 sections, 21 equations, 16 figures, 6 tables)

This paper contains 33 sections, 21 equations, 16 figures, 6 tables.

Figures (16)

  • Figure 1: 1. Population statistics from previous years are often used to inform the survey design process; Differential privacy can be used at this stage to protect sensitive information (e.g., population counts). 2. The survey process includes selecting the amount of the population to sub-sample as well as collecting information from individuals in multiple phases (e.g., phone calls and in-person interviews). 3. The collected data is used for important tasks, such as the allocation of funds or the release of migration patterns. The paper studies the fairness impacts of this pipeline (steps 1 and 2) on multiple population segments.
  • Figure 2: Disparate errors when allocating a proportional number of surveys to each racial group in Nebraska using 2022 ACS data. 2021 ACS data is used to compute the proportional allocation which subsample 1% of the total population.
  • Figure 3: Estimating the variance of mean income in Connecticut using race as a subgroup with different privacy budget $\varepsilon$. Points: actual estimator measurement, curves: proxy function fitting. Results averaged over 200 trials and 200 data points.
  • Figure 4: Relative group errors from estimating mean income in Connecticut.
  • Figure 5: Number of surveys allocated for each subgroup in the experiments reported in Figure \ref{['fig:relative_error_no_privacy']}.
  • ...and 11 more figures

Theorems & Definitions (2)

  • proof : Proof of Theorem \ref{['thm:bias']}
  • proof : Proof of Corollary \ref{['cor:bias_aggr']}