Privacy Amplification for Synthetic data using Range Restriction

Monika Hu; Matthew R. Williams; Terrance D. Savitsky

Privacy Amplification for Synthetic data using Range Restriction

Monika Hu, Matthew R. Williams, Terrance D. Savitsky

TL;DR

The paper proposes range-restricted privacy standards for synthetic data by conditioning the risk-weighted pseudo posterior mechanism on owner-defined sensitive ranges, enabling privacy amplification by restricting protection to a subspace of values. It formalizes two approaches—range-averaged privacy, which uses distributional information within the sensitive range, and range-truncated privacy, which relies on range endpoints—leading to per-record adjustments that tighten the Lipschitz sensitivity in the asymptotic DP regime. Through simulations and an accelerated life testing application, the authors show that these range-restricted schemes can achieve stronger privacy for the same budget and, in many settings, improve utility, with tunable trade-offs via the width of the sensitive range and tail assignments. The framework generalizes to a unifying γ-based formulation, connects to aDP and Pufferfish-style privacy notions, and offers practical flexibility for data disseminators to tailor protection to subsets of the data while preserving utility. Overall, the work demonstrates concrete pathways to amplify privacy by incorporating publicly known information about sensitive ranges into model-based synthetic data generation.

Abstract

We introduce a new class of range restricted formal data privacy standards that condition on owner beliefs about sensitive data ranges. By incorporating this additional information, we can provide a stronger privacy guarantee (e.g. an amplification). The range restricted formal privacy standards protect only a subset (or ball) of data values and exclude ranges (or balls) believed to be already publicly known. The privacy standards are designed for the risk-weighted pseudo posterior (model) mechanism (PPM) used to generate synthetic data under an asymptotic Differential (aDP) privacy guarantee. The PPM downweights the likelihood contribution for each record proportionally to its disclosure risk. The PPM is adapted under inclusion of beliefs by adjusting the risk-weighted pseudo likelihood. We introduce two alternative adjustments. The first expresses data owner knowledge of the sensitive range as a probability, $λ$, that a datum value drawn from the underlying generating distribution lies outside the ball or subspace of values that are sensitive. The portion of each datum likelihood contribution deemed sensitive is then $(1-λ) \leq 1$ and is the only portion of the likelihood subject to risk down-weighting. The second adjustment encodes knowledge as the difference in probability masses $P(R) \leq 1$ between the edges of the sensitive range, $R$. We use the resulting conditional (pseudo) likelihood for a sensitive record, which boosts its worst case tail values away from 0. We compare privacy and utility properties for the PPM under the aDP and range restricted privacy standards.

Privacy Amplification for Synthetic data using Range Restriction

TL;DR

Abstract

, that a datum value drawn from the underlying generating distribution lies outside the ball or subspace of values that are sensitive. The portion of each datum likelihood contribution deemed sensitive is then

and is the only portion of the likelihood subject to risk down-weighting. The second adjustment encodes knowledge as the difference in probability masses

between the edges of the sensitive range,

. We use the resulting conditional (pseudo) likelihood for a sensitive record, which boosts its worst case tail values away from 0. We compare privacy and utility properties for the PPM under the aDP and range restricted privacy standards.

Paper Structure (15 sections, 2 theorems, 27 equations, 20 figures, 1 table)

This paper contains 15 sections, 2 theorems, 27 equations, 20 figures, 1 table.

Introduction
Review of the Pseudo Posterior Mechanism from SavitskyWilliamsHu2020ppm
Probability of Record Inclusion in Sensitive Range
Range-averaged Formal Privacy
Truncation of Record Inclusion in Sensitive Range
Range-truncated Formal Privacy
General Formulation for Range-restricted Privacy
Simulation studies
Privacy guarantee strengthens under restricted sensitive range
Restricting sensitive ranges achieves higher utility for the same privacy budget
Assigning wider sensitive range in distribution tail than mode
Application to An Accelerated Life Testing Dataset
Concluding Remarks
Additional utility results from Section \ref{['sec:simulation:story3']}
Additional plots from Section \ref{['sec:application']}

Key Result

Theorem 1

$\forall \mathbf{x} \in \mathcal{X}^{n}, \mathbf{x} \in \mathcal{X}^{n-1}:\delta(\mathbf{x}, \mathbf{x}) = 1, B \in \beta_{\Theta}$ (where $\beta_{\Theta}$ is the $\sigma-$algebra of measurable sets on $\Theta$) under $\bm{\alpha}(\cdot)$ with $\Delta_{\bm{\alpha},\bm{\lambda}, \mathbf{x}} > 0$, i.e. the pseudo posterior $\xi^{\bm{\lambda}^c \bm{\alpha}(\mathbf{x})}(\cdot \mid \mathbf{x})$ has lo

Figures (20)

Figure 1: Violin plots of by-record Lipschitz bounds of Unweighted, (-Inf, Inf) (i.e., no bounds as Weighted), $(0.4, 1.8)$ averaged, $(0.4, 1.8)$ truncated, $(0.6, 1.2)$ averaged, and $(0.6, 1.2)$ truncated, over a single sample.
Figure 2: Violin plots of Lipschitz bounds of Unweighted, (-Inf, Inf) (i.e., no bounds as Weighted), $(0.4, 1.8)$ averaged, $(0.4, 1.8)$ truncated, $(0.6, 1.2)$ averaged, and $(0.6, 1.2)$ truncated, over 100 repeated samples.
Figure 3: Violin plots of average metric of ECDF of Unweighted, (-Inf, Inf) (i.e., no bounds as Weighted), $(0.4, 1.8)$ averaged, $(0.4, 1.8)$ truncated, $(0.6, 1.2)$ averaged, and $(0.6, 1.2)$ truncated, over 100 repeated samples.
Figure 4: Violin plots of Q90s of Unweighted, (-Inf, Inf) (i.e., no bounds as Weighted), $(0.4, 1.8)$ averaged, $(0.4, 1.8)$ truncated, $(0.6, 1.2)$ averaged, and $(0.6, 1.2)$ truncated, over 100 repeated samples.
Figure 5: Violin plots of Lipschitz bounds of Unweighted, (-Inf, Inf) (i.e., no bounds as Weighted), $(0.4, 1.8)$ averaged, $(0.4, 1.8)$ truncated, $(0.6, 1.2)$ averaged, and $(0.6, 1.2)$ truncated, over 100 repeated samples and four different sample sizes $n = \{200, 400, 1600, 6400\}$.
...and 15 more figures

Theorems & Definitions (8)

Definition 1
Definition 2
Theorem 1
proof
Definition 3
Definition 4
Theorem 2
proof

Privacy Amplification for Synthetic data using Range Restriction

TL;DR

Abstract

Privacy Amplification for Synthetic data using Range Restriction

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (20)

Theorems & Definitions (8)