Table of Contents
Fetching ...

Benchmarking Private Population Data Release Mechanisms: Synthetic Data vs. TopDown

Aadyaa Maddi, Swadhin Routray, Alexander Goldberg, Giulia Fanti

TL;DR

The paper addresses how to privately release hierarchical census-like data by comparing TopDown-style private statistic release with private synthetic data generation across in-distribution and out-of-distribution queries. It provides an empirical head-to-head evaluation using real ACS datasets, revealing that TopDown achieves substantially better accuracy for known queries (e.g., up to at least 20× lower error at a given privacy level) while synthetic-data methods excel for unknown queries. The findings yield practical guidelines: employ TopDown for known queries and MST-based synthetic data when queries cannot be anticipated; they also identify directions for improving DP synthetic data on hierarchical data and exploring dynamic query settings. This work advances understanding of privacy-utility trade-offs in hierarchical data release and informs practitioners about method selection under realistic query regimes and budgets.

Abstract

Differential privacy (DP) is increasingly used to protect the release of hierarchical, tabular population data, such as census data. A common approach for implementing DP in this setting is to release noisy responses to a predefined set of queries. For example, this is the approach of the TopDown algorithm used by the US Census Bureau. Such methods have an important shortcoming: they cannot answer queries for which they were not optimized. An appealing alternative is to generate DP synthetic data, which is drawn from some generating distribution. Like the TopDown method, synthetic data can also be optimized to answer specific queries, while also allowing the data user to later submit arbitrary queries over the synthetic population data. To our knowledge, there has not been a head-to-head empirical comparison of these approaches. This study conducts such a comparison between the TopDown algorithm and private synthetic data generation to determine how accuracy is affected by query complexity, in-distribution vs. out-of-distribution queries, and privacy guarantees. Our results show that for in-distribution queries, the TopDown algorithm achieves significantly better privacy-fidelity tradeoffs than any of the synthetic data methods we evaluated; for instance, in our experiments, TopDown achieved at least $20\times$ lower error on counting queries than the leading synthetic data method at the same privacy budget. Our findings suggest guidelines for practitioners and the synthetic data research community.

Benchmarking Private Population Data Release Mechanisms: Synthetic Data vs. TopDown

TL;DR

The paper addresses how to privately release hierarchical census-like data by comparing TopDown-style private statistic release with private synthetic data generation across in-distribution and out-of-distribution queries. It provides an empirical head-to-head evaluation using real ACS datasets, revealing that TopDown achieves substantially better accuracy for known queries (e.g., up to at least 20× lower error at a given privacy level) while synthetic-data methods excel for unknown queries. The findings yield practical guidelines: employ TopDown for known queries and MST-based synthetic data when queries cannot be anticipated; they also identify directions for improving DP synthetic data on hierarchical data and exploring dynamic query settings. This work advances understanding of privacy-utility trade-offs in hierarchical data release and informs practitioners about method selection under realistic query regimes and budgets.

Abstract

Differential privacy (DP) is increasingly used to protect the release of hierarchical, tabular population data, such as census data. A common approach for implementing DP in this setting is to release noisy responses to a predefined set of queries. For example, this is the approach of the TopDown algorithm used by the US Census Bureau. Such methods have an important shortcoming: they cannot answer queries for which they were not optimized. An appealing alternative is to generate DP synthetic data, which is drawn from some generating distribution. Like the TopDown method, synthetic data can also be optimized to answer specific queries, while also allowing the data user to later submit arbitrary queries over the synthetic population data. To our knowledge, there has not been a head-to-head empirical comparison of these approaches. This study conducts such a comparison between the TopDown algorithm and private synthetic data generation to determine how accuracy is affected by query complexity, in-distribution vs. out-of-distribution queries, and privacy guarantees. Our results show that for in-distribution queries, the TopDown algorithm achieves significantly better privacy-fidelity tradeoffs than any of the synthetic data methods we evaluated; for instance, in our experiments, TopDown achieved at least lower error on counting queries than the leading synthetic data method at the same privacy budget. Our findings suggest guidelines for practitioners and the synthetic data research community.
Paper Structure (21 sections, 1 equation, 5 figures)

This paper contains 21 sections, 1 equation, 5 figures.

Figures (5)

  • Figure 1: Accuracy of algorithms on in-distribution and out-of-distribution three-way marginal queries on ACS NY 2019 (left) and ACS Public Coverage 2021 (right). Accuracy describes the fraction of queries whose answers match ground truth. Both HPD methods are trained on queries used in the evaluation, whereas MST does not use prior knowledge of these queries. The size of the markers increases with $\epsilon \in \{0.125, 0.25, 0.5, 1, 2, 3\}$. All metrics are averaged over 20 runs.
  • Figure 2: Average absolute error of algorithms on in-distribution three-way marginal queries on ACS NY 2019 (left) and ACS Public Coverage 2021 (right). Absolute error between the answers returned by the algorithm and those on the ground truth dataset are averaged over 20 runs over all in-distribution queries since the TopDown algorithm cannot answer out-of-distribution queries. Both HPD methods are trained on queries used in the evaluation, whereas MST does not use prior knowledge of these queries.
  • Figure 3: Absolute error CDFs for TopDown, HPD-Fixed, HPD-Gen, and MST for in-distribution three-way marginal queries ($k = 3$) on ACS NY 2019. As $\epsilon$ increases (reduced privacy, higher accuracy), the error distributions of TopDown skew towards 0.
  • Figure 4: Difference between absolute errors for TopDown and MST for each query at $\epsilon = 1.0$ on ACS NY 2019. Positive values indicate TopDown incurs higher absolute error, whereas negative values indicate the private synthetic dataset incurs higher error. As the complexity of the query increases, the error of the synthetic dataset approaches that of the TopDown algorithm.
  • Figure 5: Performance metrics tao2021benchmarking for synthetic data algorithms trained on 3-way marginal queries at $\epsilon \in \{0.125, 0.25, 0.5, 1, 2, 3\}$ on ACS Public Coverage 2021.