Table of Contents
Fetching ...

Improving Statistical Privacy by Subsampling

Dennis Breutigam, Rüdiger Reischuk

TL;DR

This work addresses how subsampling can amplify statistical privacy (SP) when adversaries have only distributional knowledge of database entries, rather than worst-case access as in differential privacy (DP). It develops a framework based on sampling templates, Markov kernels, and Sampling Privacy Curves (SPC), leveraging $\\ ext{\alpha}$-divergence to bound privacy loss and connect SP to DP-style trade-offs. The authors derive explicit amplification bounds for three subsampling regimes—without replacement, Poisson, and with replacement—and introduce the notion of $F$-samplable distributions to handle dependencies. The results provide quantitative guidance on privacy-utility trade-offs under SP and establish a bridge to DP via trade-off functions, suggesting practical benefits of subsampling in privacy-preserving data analysis under less adversarial assumptions.

Abstract

Differential privacy (DP) considers a scenario, where an adversary has almost complete information about the entries of a database This worst-case assumption is likely to overestimate the privacy thread for an individual in real life. Statistical privacy (SP) denotes a setting where only the distribution of the database entries is known to an adversary, but not their exact values. In this case one has to analyze the interaction between noiseless privacy based on the entropy of distributions and privacy mechanisms that distort the answers of queries, which can be quite complex. A privacy mechanism often used is to take samples of the data for answering a query. This paper proves precise bounds how much different methods of sampling increase privacy in the statistical setting with respect to database size and sampling rate. They allow us to deduce when and how much sampling provides an improvement and how far this depends on the privacy parameter ε. To perform these investigations we develop a framework to model sampling techniques. For the DP setting tradeoff functions have been proposed as a finer measure for privacy compared to (ε,δ)-pairs. We apply these tools to statistical privacy with subsampling to get a comparable characterization

Improving Statistical Privacy by Subsampling

TL;DR

This work addresses how subsampling can amplify statistical privacy (SP) when adversaries have only distributional knowledge of database entries, rather than worst-case access as in differential privacy (DP). It develops a framework based on sampling templates, Markov kernels, and Sampling Privacy Curves (SPC), leveraging -divergence to bound privacy loss and connect SP to DP-style trade-offs. The authors derive explicit amplification bounds for three subsampling regimes—without replacement, Poisson, and with replacement—and introduce the notion of -samplable distributions to handle dependencies. The results provide quantitative guidance on privacy-utility trade-offs under SP and establish a bridge to DP via trade-off functions, suggesting practical benefits of subsampling in privacy-preserving data analysis under less adversarial assumptions.

Abstract

Differential privacy (DP) considers a scenario, where an adversary has almost complete information about the entries of a database This worst-case assumption is likely to overestimate the privacy thread for an individual in real life. Statistical privacy (SP) denotes a setting where only the distribution of the database entries is known to an adversary, but not their exact values. In this case one has to analyze the interaction between noiseless privacy based on the entropy of distributions and privacy mechanisms that distort the answers of queries, which can be quite complex. A privacy mechanism often used is to take samples of the data for answering a query. This paper proves precise bounds how much different methods of sampling increase privacy in the statistical setting with respect to database size and sampling rate. They allow us to deduce when and how much sampling provides an improvement and how far this depends on the privacy parameter ε. To perform these investigations we develop a framework to model sampling techniques. For the DP setting tradeoff functions have been proposed as a finer measure for privacy compared to (ε,δ)-pairs. We apply these tools to statistical privacy with subsampling to get a comparable characterization

Paper Structure

This paper contains 13 sections, 16 theorems, 58 equations, 3 figures.

Key Result

lemma thmcounterlemma

For a a database distribution $\mu$, a query $F$, a sampling technique ${\cal T}$ and $S \subseteq A$ it holds

Figures (3)

  • Figure 1: $\delta$ values for property queries with $p=1/2$ for different $\varepsilon$ values. To ensure that the curves of $\varepsilon = 1$, $\varepsilon = 0.3$, and $\varepsilon = 0.1$ are better differentiated from each other, an upward correction of $0.001$ was made for $\varepsilon = 0.3$ and $0.002$ for $\varepsilon = 0.1$.
  • Figure 2: Ratio of the $\delta$ parameter with, resp. without subsampling given a database of size $n=1000$ for property queries with $p=1/2$ and different sampling rates $\lambda$. Here $\varepsilon$ takes the values $0.1$, $0.075$, $0.05$ and $0.025$.
  • Figure 3: Ratio of the $\delta$ parameter for Poisson subsampling given a database of size $n=100$ for property queries with $p=1/2$ and different sampling rates $\lambda$. Here $\varepsilon$ takes the values $0.1, 0.075, 0.05, 0.025$.

Theorems & Definitions (40)

  • definition thmcounterdefinition
  • definition thmcounterdefinition
  • definition thmcounterdefinition
  • definition thmcounterdefinition
  • definition thmcounterdefinition
  • definition thmcounterdefinition
  • lemma thmcounterlemma
  • proof
  • lemma thmcounterlemma
  • proof
  • ...and 30 more