Table of Contents
Fetching ...

Slowly Scaling Per-Record Differential Privacy

Brian Finley, Anthony M Caruso, Justin C Doty, Ashwin Machanavajjhala, Mikaela R Meyer, David Pujol, William Sexton, Zachary Terner

TL;DR

This work introduces slowly scaling per-record zero-concentrated differential privacy (PRzCDP) mechanisms to protect statistics derived from data with heavy tails. It presents two mechanism families—transformation mechanisms (concave mappings with Gaussian noise) and additive mechanisms (fat-tailed noise)—that ensure privacy loss scales sublinearly with a record's influence, mitigating extreme losses from outliers. The paper provides formal PRzCDP guarantees, unbiased estimators for transformed queries, and detailed empirical evaluation on CBP-like and cattle datasets, showing improved privacy for large-influence records while maintaining utility. These mechanisms enable more nuanced privacy-utility tradeoffs for large establishments and other high-impact records in economic data releases.

Abstract

We develop formal privacy mechanisms for releasing statistics from data with many outlying values, such as income data. These mechanisms ensure that a per-record differential privacy guarantee degrades slowly in the protected records' influence on the statistics being released. Formal privacy mechanisms generally add randomness, or "noise," to published statistics. If a noisy statistic's distribution changes little with the addition or deletion of a single record in the underlying dataset, an attacker looking at this statistic will find it plausible that any particular record was present or absent, preserving the records' privacy. More influential records -- those whose addition or deletion would change the statistics' distribution more -- typically suffer greater privacy loss. The per-record differential privacy framework quantifies these record-specific privacy guarantees, but existing mechanisms let these guarantees degrade rapidly (linearly or quadratically) with influence. While this may be acceptable in cases with some moderately influential records, it results in unacceptably high privacy losses when records' influence varies widely, as is common in economic data. We develop mechanisms with privacy guarantees that instead degrade as slowly as logarithmically with influence. These mechanisms allow for the accurate, unbiased release of statistics, while providing meaningful protection for highly influential records. As an example, we consider the private release of sums of unbounded establishment data such as payroll, where our mechanisms extend meaningful privacy protection even to very large establishments. We evaluate these mechanisms empirically and demonstrate their utility.

Slowly Scaling Per-Record Differential Privacy

TL;DR

This work introduces slowly scaling per-record zero-concentrated differential privacy (PRzCDP) mechanisms to protect statistics derived from data with heavy tails. It presents two mechanism families—transformation mechanisms (concave mappings with Gaussian noise) and additive mechanisms (fat-tailed noise)—that ensure privacy loss scales sublinearly with a record's influence, mitigating extreme losses from outliers. The paper provides formal PRzCDP guarantees, unbiased estimators for transformed queries, and detailed empirical evaluation on CBP-like and cattle datasets, showing improved privacy for large-influence records while maintaining utility. These mechanisms enable more nuanced privacy-utility tradeoffs for large establishments and other high-impact records in economic data releases.

Abstract

We develop formal privacy mechanisms for releasing statistics from data with many outlying values, such as income data. These mechanisms ensure that a per-record differential privacy guarantee degrades slowly in the protected records' influence on the statistics being released. Formal privacy mechanisms generally add randomness, or "noise," to published statistics. If a noisy statistic's distribution changes little with the addition or deletion of a single record in the underlying dataset, an attacker looking at this statistic will find it plausible that any particular record was present or absent, preserving the records' privacy. More influential records -- those whose addition or deletion would change the statistics' distribution more -- typically suffer greater privacy loss. The per-record differential privacy framework quantifies these record-specific privacy guarantees, but existing mechanisms let these guarantees degrade rapidly (linearly or quadratically) with influence. While this may be acceptable in cases with some moderately influential records, it results in unacceptably high privacy losses when records' influence varies widely, as is common in economic data. We develop mechanisms with privacy guarantees that instead degrade as slowly as logarithmically with influence. These mechanisms allow for the accurate, unbiased release of statistics, while providing meaningful protection for highly influential records. As an example, we consider the private release of sums of unbounded establishment data such as payroll, where our mechanisms extend meaningful privacy protection even to very large establishments. We evaluate these mechanisms empirically and demonstrate their utility.
Paper Structure (36 sections, 34 theorems, 81 equations, 12 figures, 8 tables, 2 algorithms)

This paper contains 36 sections, 34 theorems, 81 equations, 12 figures, 8 tables, 2 algorithms.

Key Result

Theorem 2.4

Let $M_1, M_2$ be randomized mechanisms which satisfy $\rho_1$-zCDP and $\rho_2$-zCDP respectively. Then the mechanism $M'(D) = \left(M_1(D), M_2(D)\right)$ satisfies $(\rho_1 + \rho_2)$-zCDP.

Figures (12)

  • Figure 1: Probability densities for two exponential polylogarithmic distributions and a Normal distribution. Distributions in plot (a) have unit variance. Additive mechanisms using distributions in plot (b) have PRzCDP privacy loss of $1$ when $\Delta(r) = 1$ (that is, mechanism $i$ satisfies $P_i$-PRzCDP such that $P_i(1)=1$).
  • Figure 2: This figure shows violin plots for the three variables we consider from the sampled simulated CBP data. The image on the left shows the raw data and the image on the right shows the data grouped and summed by NAICS3$\times$COUNTY. There are 1,095 groups in total. The $y$-axis is on the log scale since all three variables contain large outliers. The black horizontal lines in each of the violins represent the 25%, 50%, and 75% quantiles, respectively.
  • Figure 3: This histogram displays the distribution of the number of cattle in the USDA Cattle Inventory Survey dataset. The image on the left shows the raw data and the image on the right shows the cattle data grouped and summed by state. There are 43 groups in total. The $x$-axis is on the log scale since the dataset contains many large outliers.
  • Figure 4: This figure shows empirical CDFs of the privacy loss for each mechanism for the sampled simulated data. For a given value on the $x$-axis, the $y$-axis shows the proportion of records which have at most that level of privacy loss. These CDFs were built by getting the privacy loss for each record and then computing, over a range of privacy loss thresholds, the proportion of privacy losses which were less than or equal to each threshold value. For each attribute, the mechanisms' parameters were calculated such that their standard deviations are equal to $\sqrt{0.5}*\text{median(attribute value)}$ on a query equal to median(attribute value).
  • Figure 5: This figure shows an empirical CDF of the privacy loss for each mechanism for the cattle data. For every value on the $x$-axis, the $y$-axis shows the proportion of records which have at most that privacy loss. These CDFs were built by getting the privacy loss for each record and then computing, over a range of privacy loss thresholds, the proportion of privacy losses which were less than or equal to each threshold value. The mechanisms' parameters were calculated such that their standard deviations are equal to $\sqrt{0.5}*\text{median(cattle value)}$ on a query equal to median(cattle value).
  • ...and 7 more figures

Theorems & Definitions (76)

  • Definition 2.1: Rényi Divergence renyi1961measures
  • Definition 2.2: Neighboring Databases
  • Definition 2.3: Zero-Concentrated Differential Privacy
  • Theorem 2.4: zCDP Sequential Composition bun_concentrated_2016
  • Theorem 2.5: zCDP Parallel Composition bun_concentrated_2016
  • Theorem 2.6: zCDP Post-processing bun_concentrated_2016
  • Definition 2.7: $\ell_2$-Sensitivity
  • Definition 2.8: Gaussian Mechanism bun_concentrated_2016
  • Theorem 2.9: zCDP Group Privacy bun_concentrated_2016
  • Definition 2.10: Record-dependent policy function
  • ...and 66 more