Table of Contents
Fetching ...

Nebula: Efficient, Private and Accurate Histogram Estimation

Ali Shahin Shamsabadi, Peter Snyder, Ralph Giles, Aurélien Bellet, Hamed Haddadi

TL;DR

Nebula tackles private distributed histogram estimation under an adversarial setting by combining sampling, thresholding, and dummy data with a secret-sharing protocol that uses two non-colluding untrusted servers. It employs a Verifiable Oblivious PRF for randomness, tau-out-of-N secret sharing for client data, and dummy data injected via a truncated discrete Laplace mechanism to achieve $(\varepsilon,\delta)$-DP without trusted third parties. The approach extends to nested, high-dimensional marginal histograms (Nested-Nebula) and demonstrates strong utility, efficiency, and scalability across real datasets, including Census and Shakespeare. The work provides formal privacy and cryptographic analyses, empirical evaluations, and an open-source implementation, offering a practical path to private, scalable histogram estimation in distributed settings.

Abstract

We present \textit{Nebula}, a system for differentially private histogram estimation on data distributed among clients. \textit{Nebula} allows clients to independently decide whether to participate in the system, and locally encode their data so that an untrusted server only learns data values whose multiplicity exceeds a predefined aggregation threshold, with $(\varepsilon,δ)$ differential privacy guarantees. Compared to existing systems, \textit{Nebula} uniquely achieves: \textit{i)} a strict upper bound on client privacy leakage; \textit{ii)} significantly higher utility than standard local differential privacy systems; and \textit{iii)} no requirement for trusted third-parties, multi-party computation, or trusted hardware. We provide a formal evaluation of \textit{Nebula}'s privacy, utility and efficiency guarantees, along with an empirical assessment on three real-world datasets. On the United States Census dataset, clients can submit their data in just 0.0036 seconds and 0.0016 MB (\textbf{efficient}), under strong $(\varepsilon=1,δ=10^{-8})$ differential privacy guarantees (\textbf{private}), enabling \textit{Nebula}'s untrusted aggregation server to estimate histograms with over 88\% better utility than existing local differential privacy deployments (\textbf{accurate}). Additionally, we describe a variant that allows clients to submit multi-dimensional data, with similar privacy, utility, and performance. Finally, we provide an implementation of \textit{Nebula}.

Nebula: Efficient, Private and Accurate Histogram Estimation

TL;DR

Nebula tackles private distributed histogram estimation under an adversarial setting by combining sampling, thresholding, and dummy data with a secret-sharing protocol that uses two non-colluding untrusted servers. It employs a Verifiable Oblivious PRF for randomness, tau-out-of-N secret sharing for client data, and dummy data injected via a truncated discrete Laplace mechanism to achieve -DP without trusted third parties. The approach extends to nested, high-dimensional marginal histograms (Nested-Nebula) and demonstrates strong utility, efficiency, and scalability across real datasets, including Census and Shakespeare. The work provides formal privacy and cryptographic analyses, empirical evaluations, and an open-source implementation, offering a practical path to private, scalable histogram estimation in distributed settings.

Abstract

We present \textit{Nebula}, a system for differentially private histogram estimation on data distributed among clients. \textit{Nebula} allows clients to independently decide whether to participate in the system, and locally encode their data so that an untrusted server only learns data values whose multiplicity exceeds a predefined aggregation threshold, with differential privacy guarantees. Compared to existing systems, \textit{Nebula} uniquely achieves: \textit{i)} a strict upper bound on client privacy leakage; \textit{ii)} significantly higher utility than standard local differential privacy systems; and \textit{iii)} no requirement for trusted third-parties, multi-party computation, or trusted hardware. We provide a formal evaluation of \textit{Nebula}'s privacy, utility and efficiency guarantees, along with an empirical assessment on three real-world datasets. On the United States Census dataset, clients can submit their data in just 0.0036 seconds and 0.0016 MB (\textbf{efficient}), under strong differential privacy guarantees (\textbf{private}), enabling \textit{Nebula}'s untrusted aggregation server to estimate histograms with over 88\% better utility than existing local differential privacy deployments (\textbf{accurate}). Additionally, we describe a variant that allows clients to submit multi-dimensional data, with similar privacy, utility, and performance. Finally, we provide an implementation of \textit{Nebula}.
Paper Structure (20 sections, 7 theorems, 7 equations, 9 figures, 4 tables)

This paper contains 20 sections, 7 theorems, 7 equations, 9 figures, 4 tables.

Key Result

Theorem 1

Consider $N$ clients generating a dataset $D=\{x_i\}_{i=1}^N$. Let $\varepsilon_{\text{Unre}}$ be the privacy budget used in the creation of dummy data (Algorithm alg:dummy). For $\varepsilon_{\text{Re}}>0$ and $\delta_{\text{Re}}\in(0,1)$, let ${p_s=\alpha (1-e^{-\varepsilon_{\text{Re}}})}$ and ${\

Figures (9)

  • Figure 1: Nebula's output to the aggregation server consists of a histogram $\mathcal{H}$ of multiplicities where $\mathcal{H}_i$ represents the number of submissions with the same tag, with multiplicity $i$ and $i \in [m]$. This histogram is obtained based on submissions that each client sent with probability $p_s$ (empty bar) and dummy data (hatched bar).
  • Figure 2: Ideal functionality for Nebula.
  • Figure 3: Utility of Nebula compared with local, Shuffle and central differential privacy applied to the Shakespeare database as a function of histogram bins using an $\varepsilon=1$ DP privacy guarantee. The word-frequency estimate of Nebula is more accurate than local and shuffle DP while while removing the trust of the central DP models on the server.
  • Figure 4: Improving the utility of Nebula in estimating the histogram on the multi-attribute IPUMS dataset and Foursquare dataset using multi-dimensional data encoding, Nested Nebula (Algorithm \ref{['alg:client-nested-encoding']}). IPUMS contains 5 attributes--S: Sex; M: Marriage status; R: Race; E: Education; A: Age. Foursquare dataset contains 8 prefixes: $\mathbf{x}^{(1)}=[x_1]$, $\mathbf{x}^{(2)}=[x_1,x_2]$, $\cdots$, $\mathbf{x}^{(8)}=[x_1,\cdots,x_{8}]$. We compute the utility as the absolute error between the original histogram and the estimated histogram. Multi-dimensional data encoding significantly improves the absolute error of each marginal histogram (i.e., histogram of each sequence of joint attributes).
  • Figure 5: Original and estimated histogram obtained privately by Nebula using Foursquare dataset. The private histogram estimated by Nebula is close to the histogram of the original data. Nebula also preserves the relative order across values.
  • ...and 4 more figures

Theorems & Definitions (14)

  • definition 1: Truncated Shifted Discrete Laplace Distribution
  • Theorem 1
  • proof
  • Proposition 1
  • proof
  • Lemma 2
  • Theorem 3
  • proof
  • Theorem 4
  • proof
  • ...and 4 more