Table of Contents
Fetching ...

Privacy Amplification for Synthetic data using Range Restriction

Monika Hu, Matthew R. Williams, Terrance D. Savitsky

TL;DR

The paper proposes range-restricted privacy standards for synthetic data by conditioning the risk-weighted pseudo posterior mechanism on owner-defined sensitive ranges, enabling privacy amplification by restricting protection to a subspace of values. It formalizes two approaches—range-averaged privacy, which uses distributional information within the sensitive range, and range-truncated privacy, which relies on range endpoints—leading to per-record adjustments that tighten the Lipschitz sensitivity in the asymptotic DP regime. Through simulations and an accelerated life testing application, the authors show that these range-restricted schemes can achieve stronger privacy for the same budget and, in many settings, improve utility, with tunable trade-offs via the width of the sensitive range and tail assignments. The framework generalizes to a unifying γ-based formulation, connects to aDP and Pufferfish-style privacy notions, and offers practical flexibility for data disseminators to tailor protection to subsets of the data while preserving utility. Overall, the work demonstrates concrete pathways to amplify privacy by incorporating publicly known information about sensitive ranges into model-based synthetic data generation.

Abstract

We introduce a new class of range restricted formal data privacy standards that condition on owner beliefs about sensitive data ranges. By incorporating this additional information, we can provide a stronger privacy guarantee (e.g. an amplification). The range restricted formal privacy standards protect only a subset (or ball) of data values and exclude ranges (or balls) believed to be already publicly known. The privacy standards are designed for the risk-weighted pseudo posterior (model) mechanism (PPM) used to generate synthetic data under an asymptotic Differential (aDP) privacy guarantee. The PPM downweights the likelihood contribution for each record proportionally to its disclosure risk. The PPM is adapted under inclusion of beliefs by adjusting the risk-weighted pseudo likelihood. We introduce two alternative adjustments. The first expresses data owner knowledge of the sensitive range as a probability, $λ$, that a datum value drawn from the underlying generating distribution lies outside the ball or subspace of values that are sensitive. The portion of each datum likelihood contribution deemed sensitive is then $(1-λ) \leq 1$ and is the only portion of the likelihood subject to risk down-weighting. The second adjustment encodes knowledge as the difference in probability masses $P(R) \leq 1$ between the edges of the sensitive range, $R$. We use the resulting conditional (pseudo) likelihood for a sensitive record, which boosts its worst case tail values away from 0. We compare privacy and utility properties for the PPM under the aDP and range restricted privacy standards.

Privacy Amplification for Synthetic data using Range Restriction

TL;DR

The paper proposes range-restricted privacy standards for synthetic data by conditioning the risk-weighted pseudo posterior mechanism on owner-defined sensitive ranges, enabling privacy amplification by restricting protection to a subspace of values. It formalizes two approaches—range-averaged privacy, which uses distributional information within the sensitive range, and range-truncated privacy, which relies on range endpoints—leading to per-record adjustments that tighten the Lipschitz sensitivity in the asymptotic DP regime. Through simulations and an accelerated life testing application, the authors show that these range-restricted schemes can achieve stronger privacy for the same budget and, in many settings, improve utility, with tunable trade-offs via the width of the sensitive range and tail assignments. The framework generalizes to a unifying γ-based formulation, connects to aDP and Pufferfish-style privacy notions, and offers practical flexibility for data disseminators to tailor protection to subsets of the data while preserving utility. Overall, the work demonstrates concrete pathways to amplify privacy by incorporating publicly known information about sensitive ranges into model-based synthetic data generation.

Abstract

We introduce a new class of range restricted formal data privacy standards that condition on owner beliefs about sensitive data ranges. By incorporating this additional information, we can provide a stronger privacy guarantee (e.g. an amplification). The range restricted formal privacy standards protect only a subset (or ball) of data values and exclude ranges (or balls) believed to be already publicly known. The privacy standards are designed for the risk-weighted pseudo posterior (model) mechanism (PPM) used to generate synthetic data under an asymptotic Differential (aDP) privacy guarantee. The PPM downweights the likelihood contribution for each record proportionally to its disclosure risk. The PPM is adapted under inclusion of beliefs by adjusting the risk-weighted pseudo likelihood. We introduce two alternative adjustments. The first expresses data owner knowledge of the sensitive range as a probability, , that a datum value drawn from the underlying generating distribution lies outside the ball or subspace of values that are sensitive. The portion of each datum likelihood contribution deemed sensitive is then and is the only portion of the likelihood subject to risk down-weighting. The second adjustment encodes knowledge as the difference in probability masses between the edges of the sensitive range, . We use the resulting conditional (pseudo) likelihood for a sensitive record, which boosts its worst case tail values away from 0. We compare privacy and utility properties for the PPM under the aDP and range restricted privacy standards.
Paper Structure (15 sections, 2 theorems, 27 equations, 20 figures, 1 table)

This paper contains 15 sections, 2 theorems, 27 equations, 20 figures, 1 table.

Key Result

Theorem 1

$\forall \mathbf{x} \in \mathcal{X}^{n}, \mathbf{x} \in \mathcal{X}^{n-1}:\delta(\mathbf{x}, \mathbf{x}) = 1, B \in \beta_{\Theta}$ (where $\beta_{\Theta}$ is the $\sigma-$algebra of measurable sets on $\Theta$) under $\bm{\alpha}(\cdot)$ with $\Delta_{\bm{\alpha},\bm{\lambda}, \mathbf{x}} > 0$, i.e. the pseudo posterior $\xi^{\bm{\lambda}^c \bm{\alpha}(\mathbf{x})}(\cdot \mid \mathbf{x})$ has lo

Figures (20)

  • Figure 1: Violin plots of by-record Lipschitz bounds of Unweighted, (-Inf, Inf) (i.e., no bounds as Weighted), $(0.4, 1.8)$ averaged, $(0.4, 1.8)$ truncated, $(0.6, 1.2)$ averaged, and $(0.6, 1.2)$ truncated, over a single sample.
  • Figure 2: Violin plots of Lipschitz bounds of Unweighted, (-Inf, Inf) (i.e., no bounds as Weighted), $(0.4, 1.8)$ averaged, $(0.4, 1.8)$ truncated, $(0.6, 1.2)$ averaged, and $(0.6, 1.2)$ truncated, over 100 repeated samples.
  • Figure 3: Violin plots of average metric of ECDF of Unweighted, (-Inf, Inf) (i.e., no bounds as Weighted), $(0.4, 1.8)$ averaged, $(0.4, 1.8)$ truncated, $(0.6, 1.2)$ averaged, and $(0.6, 1.2)$ truncated, over 100 repeated samples.
  • Figure 4: Violin plots of Q90s of Unweighted, (-Inf, Inf) (i.e., no bounds as Weighted), $(0.4, 1.8)$ averaged, $(0.4, 1.8)$ truncated, $(0.6, 1.2)$ averaged, and $(0.6, 1.2)$ truncated, over 100 repeated samples.
  • Figure 5: Violin plots of Lipschitz bounds of Unweighted, (-Inf, Inf) (i.e., no bounds as Weighted), $(0.4, 1.8)$ averaged, $(0.4, 1.8)$ truncated, $(0.6, 1.2)$ averaged, and $(0.6, 1.2)$ truncated, over 100 repeated samples and four different sample sizes $n = \{200, 400, 1600, 6400\}$.
  • ...and 15 more figures

Theorems & Definitions (8)

  • Definition 1
  • Definition 2
  • Theorem 1
  • proof
  • Definition 3
  • Definition 4
  • Theorem 2
  • proof