Table of Contents
Fetching ...

Bounding User Contributions for User-Level Differentially Private Mean Estimation

V. Arvind Rameshwar, Anshoo Tandon

TL;DR

The paper addresses releasing the mean under user-level DP for heterogeneous, bounded data. It introduces a distribution-independent worst-case error metric and identifies the optimal clipping/bounding strategy within a broad estimator class. Theoretical results provide explicit optimal bounds based on per-user contribution counts and privacy budget, while empirical studies show the proposed method outperforms prior approaches like Amin et al. 2019 in average-case performance. This work strengthens privacy-utility guarantees for realistic, non-i.i.d. data settings and suggests extending the worst-case analysis to other statistics.

Abstract

We revisit the problem of releasing the sample mean of bounded samples in a dataset, privately, under user-level $\varepsilon$-differential privacy (DP). We aim to derive the optimal method of preprocessing data samples, within a canonical class of processing strategies, in terms of the error in estimation. Typical error analyses of such \emph{bounding} (or \emph{clipping}) strategies in the literature assume that the data samples are independent and identically distributed (i.i.d.), and sometimes also that all users contribute the same number of samples (data homogeneity) -- assumptions that do not accurately model real-world data distributions. Our main result in this work is a precise characterization of the preprocessing strategy that gives rise to the smallest \emph{worst-case} error over all datasets -- a \emph{distribution-independent} error metric -- while allowing for data heterogeneity. We also show via experimental studies that even for i.i.d. real-valued samples, our clipping strategy performs much better, in terms of \emph{average-case} error, than the widely used bounding strategy of Amin et al. (2019).

Bounding User Contributions for User-Level Differentially Private Mean Estimation

TL;DR

The paper addresses releasing the mean under user-level DP for heterogeneous, bounded data. It introduces a distribution-independent worst-case error metric and identifies the optimal clipping/bounding strategy within a broad estimator class. Theoretical results provide explicit optimal bounds based on per-user contribution counts and privacy budget, while empirical studies show the proposed method outperforms prior approaches like Amin et al. 2019 in average-case performance. This work strengthens privacy-utility guarantees for realistic, non-i.i.d. data settings and suggests extending the worst-case analysis to other statistics.

Abstract

We revisit the problem of releasing the sample mean of bounded samples in a dataset, privately, under user-level -differential privacy (DP). We aim to derive the optimal method of preprocessing data samples, within a canonical class of processing strategies, in terms of the error in estimation. Typical error analyses of such \emph{bounding} (or \emph{clipping}) strategies in the literature assume that the data samples are independent and identically distributed (i.i.d.), and sometimes also that all users contribute the same number of samples (data homogeneity) -- assumptions that do not accurately model real-world data distributions. Our main result in this work is a precise characterization of the preprocessing strategy that gives rise to the smallest \emph{worst-case} error over all datasets -- a \emph{distribution-independent} error metric -- while allowing for data heterogeneity. We also show via experimental studies that even for i.i.d. real-valued samples, our clipping strategy performs much better, in terms of \emph{average-case} error, than the widely used bounding strategy of Amin et al. (2019).

Paper Structure

This paper contains 14 sections, 8 theorems, 24 equations, 3 figures.

Key Result

Theorem 2.1

For any $g: \mathsf{D}\to \mathbb{R}^d$, the mechanism $M^{\text{Lap}}_g: \mathsf{D}\to \mathbb{R}$ defined by $M^{\text{Lap}}_g(\mathcal{D}_1) = g(\mathcal{D}_1)+\mathbf{Z},$ where $\mathbf{Z} = (Z_1,\ldots,Z_d)$ is such that $Z_i\stackrel{\text{i.i.d.}}{\sim} \text{Lap}(\Delta_g/\varepsilon)$, $i\

Figures (3)

  • Figure 1: The annulus $\mathsf{A}_{a_j^{(\ell)},b_j^{(\ell)}}$, for $d=2$, shown in blue. Here, the points $q_1$, $q_2$ equal $\mathsf{A}_{a_j^{(\ell)},b_j^{(\ell)}}(p_1)$ and $\mathsf{A}_{a_j^{(\ell)},b_j^{(\ell)}}(p_2)$, respectively.
  • Figure 2: Average-case errors using a geometric collection of $\{m_\ell\}$ values and uniform samples
  • Figure 3: Average-case errors using an extreme-valued collection of $\{m_\ell\}$ values and projected Gaussian samples

Theorems & Definitions (16)

  • Definition 2.1
  • Definition 2.2
  • Theorem 2.1
  • Proposition 3.1
  • Lemma 3.1
  • proof
  • Lemma 3.2
  • proof
  • Lemma 3.3
  • proof
  • ...and 6 more