Slowly Scaling Per-Record Differential Privacy

Brian Finley; Anthony M Caruso; Justin C Doty; Ashwin Machanavajjhala; Mikaela R Meyer; David Pujol; William Sexton; Zachary Terner

Slowly Scaling Per-Record Differential Privacy

Brian Finley, Anthony M Caruso, Justin C Doty, Ashwin Machanavajjhala, Mikaela R Meyer, David Pujol, William Sexton, Zachary Terner

TL;DR

This work introduces slowly scaling per-record zero-concentrated differential privacy (PRzCDP) mechanisms to protect statistics derived from data with heavy tails. It presents two mechanism families—transformation mechanisms (concave mappings with Gaussian noise) and additive mechanisms (fat-tailed noise)—that ensure privacy loss scales sublinearly with a record's influence, mitigating extreme losses from outliers. The paper provides formal PRzCDP guarantees, unbiased estimators for transformed queries, and detailed empirical evaluation on CBP-like and cattle datasets, showing improved privacy for large-influence records while maintaining utility. These mechanisms enable more nuanced privacy-utility tradeoffs for large establishments and other high-impact records in economic data releases.

Abstract

We develop formal privacy mechanisms for releasing statistics from data with many outlying values, such as income data. These mechanisms ensure that a per-record differential privacy guarantee degrades slowly in the protected records' influence on the statistics being released. Formal privacy mechanisms generally add randomness, or "noise," to published statistics. If a noisy statistic's distribution changes little with the addition or deletion of a single record in the underlying dataset, an attacker looking at this statistic will find it plausible that any particular record was present or absent, preserving the records' privacy. More influential records -- those whose addition or deletion would change the statistics' distribution more -- typically suffer greater privacy loss. The per-record differential privacy framework quantifies these record-specific privacy guarantees, but existing mechanisms let these guarantees degrade rapidly (linearly or quadratically) with influence. While this may be acceptable in cases with some moderately influential records, it results in unacceptably high privacy losses when records' influence varies widely, as is common in economic data. We develop mechanisms with privacy guarantees that instead degrade as slowly as logarithmically with influence. These mechanisms allow for the accurate, unbiased release of statistics, while providing meaningful protection for highly influential records. As an example, we consider the private release of sums of unbounded establishment data such as payroll, where our mechanisms extend meaningful privacy protection even to very large establishments. We evaluate these mechanisms empirically and demonstrate their utility.

Slowly Scaling Per-Record Differential Privacy

TL;DR

Abstract

Paper Structure (36 sections, 34 theorems, 81 equations, 12 figures, 8 tables, 2 algorithms)

This paper contains 36 sections, 34 theorems, 81 equations, 12 figures, 8 tables, 2 algorithms.

Introduction
Preliminaries
Data Model
Zero-Concentrated Differential Privacy
Per-Record Zero-Concentrated Differential Privacy
Properties of Per-Record Differential Privacy
Unit Splitting
Slowly Scaling Mechanisms
Transformation Mechanisms
Kth Root Transformation Mechanism $f(x) = \sqrt[k]{x}$
Log Transformation Mechanism $f(x) = \ln(x)$
Unbiased Estimators ($g$)
Additive Mechanisms
Generalized Gaussian Mechanism $\left(f\left(|z|\right) = -\left(\frac{|z|}{\sigma}\right)^p\right)$
Exponential Polylogarithmic Distribution $\left(f\left(|z|\right) = -d\ln\left(\frac{|z|}{\sigma} + a\right)^p\right)$
...and 21 more sections

Key Result

Theorem 2.4

Let $M_1, M_2$ be randomized mechanisms which satisfy $\rho_1$-zCDP and $\rho_2$-zCDP respectively. Then the mechanism $M'(D) = \left(M_1(D), M_2(D)\right)$ satisfies $(\rho_1 + \rho_2)$-zCDP.

Figures (12)

Figure 1: Probability densities for two exponential polylogarithmic distributions and a Normal distribution. Distributions in plot (a) have unit variance. Additive mechanisms using distributions in plot (b) have PRzCDP privacy loss of $1$ when $\Delta(r) = 1$ (that is, mechanism $i$ satisfies $P_i$-PRzCDP such that $P_i(1)=1$).
Figure 2: This figure shows violin plots for the three variables we consider from the sampled simulated CBP data. The image on the left shows the raw data and the image on the right shows the data grouped and summed by NAICS3$\times$COUNTY. There are 1,095 groups in total. The $y$-axis is on the log scale since all three variables contain large outliers. The black horizontal lines in each of the violins represent the 25%, 50%, and 75% quantiles, respectively.
Figure 3: This histogram displays the distribution of the number of cattle in the USDA Cattle Inventory Survey dataset. The image on the left shows the raw data and the image on the right shows the cattle data grouped and summed by state. There are 43 groups in total. The $x$-axis is on the log scale since the dataset contains many large outliers.
Figure 4: This figure shows empirical CDFs of the privacy loss for each mechanism for the sampled simulated data. For a given value on the $x$-axis, the $y$-axis shows the proportion of records which have at most that level of privacy loss. These CDFs were built by getting the privacy loss for each record and then computing, over a range of privacy loss thresholds, the proportion of privacy losses which were less than or equal to each threshold value. For each attribute, the mechanisms' parameters were calculated such that their standard deviations are equal to $\sqrt{0.5}*\text{median(attribute value)}$ on a query equal to median(attribute value).
Figure 5: This figure shows an empirical CDF of the privacy loss for each mechanism for the cattle data. For every value on the $x$-axis, the $y$-axis shows the proportion of records which have at most that privacy loss. These CDFs were built by getting the privacy loss for each record and then computing, over a range of privacy loss thresholds, the proportion of privacy losses which were less than or equal to each threshold value. The mechanisms' parameters were calculated such that their standard deviations are equal to $\sqrt{0.5}*\text{median(cattle value)}$ on a query equal to median(cattle value).
...and 7 more figures

Theorems & Definitions (76)

Definition 2.1: Rényi Divergence renyi1961measures
Definition 2.2: Neighboring Databases
Definition 2.3: Zero-Concentrated Differential Privacy
Theorem 2.4: zCDP Sequential Composition bun_concentrated_2016
Theorem 2.5: zCDP Parallel Composition bun_concentrated_2016
Theorem 2.6: zCDP Post-processing bun_concentrated_2016
Definition 2.7: $\ell_2$-Sensitivity
Definition 2.8: Gaussian Mechanism bun_concentrated_2016
Theorem 2.9: zCDP Group Privacy bun_concentrated_2016
Definition 2.10: Record-dependent policy function
...and 66 more

Slowly Scaling Per-Record Differential Privacy

TL;DR

Abstract

Slowly Scaling Per-Record Differential Privacy

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (12)

Theorems & Definitions (76)