Table of Contents
Fetching ...

Privately Answering Queries on Skewed Data via Per Record Differential Privacy

Jeremy Seeman, William Sexton, David Pujol, Ashwin Machanavajjhala

TL;DR

This work introduces per-record zero-concentrated DP (PRzCDP), a privacy framework in which a record’s privacy loss is a function of that record’s confidential value. A public policy function P maps hypothetical records to maximum allowable privacy loss, while the actual losses depend on the confidential data, enabling stronger utility for skewed or heavy-tailed statistics. The authors propose unit splitting as a preprocessing step that converts PRzCDP demands into standard zCDP mechanisms on split data, providing a constructive and flexible way to publish private SQL-style aggregates with reduced sensitivity. Empirical results on simulated, CIS, and CBP datasets demonstrate substantial utility improvements over global zCDP for skewed data workloads, validating PRzCDP as a practical approach for data products with influential outliers, such as county-level payrolls and establishment counts. The work also outlines future directions for stronger semantic guarantees and extensions to interactive query settings.

Abstract

We consider the problem of the private release of statistics (like aggregate payrolls) where it is critical to preserve the contribution made by a small number of outlying large entities. We propose a privacy formalism, per-record zero concentrated differential privacy (PzCDP), where the privacy loss associated with each record is a public function of that record's value. Unlike other formalisms which provide different privacy losses to different records, PRzCDP's privacy loss depends explicitly on the confidential data. We define our formalism, derive its properties, and propose mechanisms which satisfy PRzCDP that are uniquely suited to publishing skewed or heavy-tailed statistics, where a small number of records contribute substantially to query answers. This targeted relaxation helps overcome the difficulties of applying standard DP to these data products.

Privately Answering Queries on Skewed Data via Per Record Differential Privacy

TL;DR

This work introduces per-record zero-concentrated DP (PRzCDP), a privacy framework in which a record’s privacy loss is a function of that record’s confidential value. A public policy function P maps hypothetical records to maximum allowable privacy loss, while the actual losses depend on the confidential data, enabling stronger utility for skewed or heavy-tailed statistics. The authors propose unit splitting as a preprocessing step that converts PRzCDP demands into standard zCDP mechanisms on split data, providing a constructive and flexible way to publish private SQL-style aggregates with reduced sensitivity. Empirical results on simulated, CIS, and CBP datasets demonstrate substantial utility improvements over global zCDP for skewed data workloads, validating PRzCDP as a practical approach for data products with influential outliers, such as county-level payrolls and establishment counts. The work also outlines future directions for stronger semantic guarantees and extensions to interactive query settings.

Abstract

We consider the problem of the private release of statistics (like aggregate payrolls) where it is critical to preserve the contribution made by a small number of outlying large entities. We propose a privacy formalism, per-record zero concentrated differential privacy (PzCDP), where the privacy loss associated with each record is a public function of that record's value. Unlike other formalisms which provide different privacy losses to different records, PRzCDP's privacy loss depends explicitly on the confidential data. We define our formalism, derive its properties, and propose mechanisms which satisfy PRzCDP that are uniquely suited to publishing skewed or heavy-tailed statistics, where a small number of records contribute substantially to query answers. This targeted relaxation helps overcome the difficulties of applying standard DP to these data products.
Paper Structure (26 sections, 14 theorems, 34 equations, 5 figures, 8 tables, 1 algorithm)

This paper contains 26 sections, 14 theorems, 34 equations, 5 figures, 8 tables, 1 algorithm.

Key Result

theorem 1

Let $M_1, M_2$ be randomized mechanisms which satisfy $\rho_1-$zCDP and $\rho_2-$zCDP respectively. Then the mechanism $M'(D) = (M_1(D), M_2(D))$ satisfies $(\rho_1 + \rho_2)-$zCDP.

Figures (5)

  • Figure 1: Theoretical MSE over the expected query value for global $\rho$-zCDP mechanisms with different global sensitivities $\Delta$, privacy loss budgets $\rho$, and different tail parameters $\alpha$ for $n = 1000$. Optimal $\Delta$ for minimizing MSE given $\rho$ and $\alpha$ shown in red. Blue dashed line at 1, for reference.
  • Figure 2: (Left) distribution of ARE over workload queries (y-axis) by proportion of records with policy loss greater than $\rho$ (x-axis). (Right) Empirical CDFs of policy loss, i.e. proportion of observed records (y-axis) with policy loss bounded by $P(r)$ (x-axis). Columnar subplots show different levels of minimum policy loss $\rho$. Red line represents 100% ARE and green line represents 10% ARE.
  • Figure 3: AREs for the CBP query workload using topcoding and zCDP (left) versus using unit splitting (right) for different NAICS levels (rows) and splitting schemes (columns). Red line represents 20% ARE and green line represents 5% ARE.
  • Figure 4: Theoretical minimum policy function CDFs to achieve different fitness-for-use goals on 95% of the COUNTY by NAICS code query workload. The green dashed line represents the total unsplit privacy loss budget of 1.
  • Figure 5: CBP policy losses grouped by establishment (left) and firm (right)

Theorems & Definitions (27)

  • Definition 1: Neighboring Databases
  • Definition 2: $\ell_2$-Sensitivity
  • Definition 3: Zero-Concentrated Differential Privacy
  • theorem 1: zCDP Sequential Composition bun_concentrated_2016
  • theorem 2: zCDP Parallel Composition bun_concentrated_2016
  • theorem 3: zCDP Post-processing bun_concentrated_2016
  • Definition 4: Gaussian Mechanism bun_concentrated_2016
  • theorem 4: zCDP Group Privacy
  • Definition 5: Record-dependent policy function
  • Definition 6: $P$-per-record zero-Concentrated DP ($P$-PRzCDP)
  • ...and 17 more