Nebula: Efficient, Private and Accurate Histogram Estimation

Ali Shahin Shamsabadi; Peter Snyder; Ralph Giles; Aurélien Bellet; Hamed Haddadi

Nebula: Efficient, Private and Accurate Histogram Estimation

Ali Shahin Shamsabadi, Peter Snyder, Ralph Giles, Aurélien Bellet, Hamed Haddadi

TL;DR

Nebula tackles private distributed histogram estimation under an adversarial setting by combining sampling, thresholding, and dummy data with a secret-sharing protocol that uses two non-colluding untrusted servers. It employs a Verifiable Oblivious PRF for randomness, tau-out-of-N secret sharing for client data, and dummy data injected via a truncated discrete Laplace mechanism to achieve $(\varepsilon,\delta)$-DP without trusted third parties. The approach extends to nested, high-dimensional marginal histograms (Nested-Nebula) and demonstrates strong utility, efficiency, and scalability across real datasets, including Census and Shakespeare. The work provides formal privacy and cryptographic analyses, empirical evaluations, and an open-source implementation, offering a practical path to private, scalable histogram estimation in distributed settings.

Abstract

We present \textit{Nebula}, a system for differentially private histogram estimation on data distributed among clients. \textit{Nebula} allows clients to independently decide whether to participate in the system, and locally encode their data so that an untrusted server only learns data values whose multiplicity exceeds a predefined aggregation threshold, with $(\varepsilon,δ)$ differential privacy guarantees. Compared to existing systems, \textit{Nebula} uniquely achieves: \textit{i)} a strict upper bound on client privacy leakage; \textit{ii)} significantly higher utility than standard local differential privacy systems; and \textit{iii)} no requirement for trusted third-parties, multi-party computation, or trusted hardware. We provide a formal evaluation of \textit{Nebula}'s privacy, utility and efficiency guarantees, along with an empirical assessment on three real-world datasets. On the United States Census dataset, clients can submit their data in just 0.0036 seconds and 0.0016 MB (\textbf{efficient}), under strong $(\varepsilon=1,δ=10^{-8})$ differential privacy guarantees (\textbf{private}), enabling \textit{Nebula}'s untrusted aggregation server to estimate histograms with over 88\% better utility than existing local differential privacy deployments (\textbf{accurate}). Additionally, we describe a variant that allows clients to submit multi-dimensional data, with similar privacy, utility, and performance. Finally, we provide an implementation of \textit{Nebula}.

Nebula: Efficient, Private and Accurate Histogram Estimation

TL;DR

-DP without trusted third parties. The approach extends to nested, high-dimensional marginal histograms (Nested-Nebula) and demonstrates strong utility, efficiency, and scalability across real datasets, including Census and Shakespeare. The work provides formal privacy and cryptographic analyses, empirical evaluations, and an open-source implementation, offering a practical path to private, scalable histogram estimation in distributed settings.

Abstract

differential privacy guarantees. Compared to existing systems, \textit{Nebula} uniquely achieves: \textit{i)} a strict upper bound on client privacy leakage; \textit{ii)} significantly higher utility than standard local differential privacy systems; and \textit{iii)} no requirement for trusted third-parties, multi-party computation, or trusted hardware. We provide a formal evaluation of \textit{Nebula}'s privacy, utility and efficiency guarantees, along with an empirical assessment on three real-world datasets. On the United States Census dataset, clients can submit their data in just 0.0036 seconds and 0.0016 MB (\textbf{efficient}), under strong

differential privacy guarantees (\textbf{private}), enabling \textit{Nebula}'s untrusted aggregation server to estimate histograms with over 88\% better utility than existing local differential privacy deployments (\textbf{accurate}). Additionally, we describe a variant that allows clients to submit multi-dimensional data, with similar privacy, utility, and performance. Finally, we provide an implementation of \textit{Nebula}.

Paper Structure (20 sections, 7 theorems, 7 equations, 9 figures, 4 tables)

This paper contains 20 sections, 7 theorems, 7 equations, 9 figures, 4 tables.

Introduction
Problem & Threat Model
Nebula Design
Local Data Preparation and Submission
Dummy Data Injection
Data Aggregation and Recovery
Privacy, Security, Utility and Communication Analysis
Privacy Analysis
Cryptographic Security
Communication Analysis
Utility Analysis
Nested-Nebula: a variant for high-dimensional marginal histograms
Experiments
Utility Comparison to Existing Works
Utility Improvements via Nested-Nebula
...and 5 more sections

Key Result

Theorem 1

Consider $N$ clients generating a dataset $D=\{x_i\}_{i=1}^N$. Let $\varepsilon_{\text{Unre}}$ be the privacy budget used in the creation of dummy data (Algorithm alg:dummy). For $\varepsilon_{\text{Re}}>0$ and $\delta_{\text{Re}}\in(0,1)$, let ${p_s=\alpha (1-e^{-\varepsilon_{\text{Re}}})}$ and ${\

Figures (9)

Figure 1: Nebula's output to the aggregation server consists of a histogram $\mathcal{H}$ of multiplicities where $\mathcal{H}_i$ represents the number of submissions with the same tag, with multiplicity $i$ and $i \in [m]$. This histogram is obtained based on submissions that each client sent with probability $p_s$ (empty bar) and dummy data (hatched bar).
Figure 2: Ideal functionality for Nebula.
Figure 3: Utility of Nebula compared with local, Shuffle and central differential privacy applied to the Shakespeare database as a function of histogram bins using an $\varepsilon=1$ DP privacy guarantee. The word-frequency estimate of Nebula is more accurate than local and shuffle DP while while removing the trust of the central DP models on the server.
Figure 4: Improving the utility of Nebula in estimating the histogram on the multi-attribute IPUMS dataset and Foursquare dataset using multi-dimensional data encoding, Nested Nebula (Algorithm \ref{['alg:client-nested-encoding']}). IPUMS contains 5 attributes--S: Sex; M: Marriage status; R: Race; E: Education; A: Age. Foursquare dataset contains 8 prefixes: $\mathbf{x}^{(1)}=[x_1]$, $\mathbf{x}^{(2)}=[x_1,x_2]$, $\cdots$, $\mathbf{x}^{(8)}=[x_1,\cdots,x_{8}]$. We compute the utility as the absolute error between the original histogram and the estimated histogram. Multi-dimensional data encoding significantly improves the absolute error of each marginal histogram (i.e., histogram of each sequence of joint attributes).
Figure 5: Original and estimated histogram obtained privately by Nebula using Foursquare dataset. The private histogram estimated by Nebula is close to the histogram of the original data. Nebula also preserves the relative order across values.
...and 4 more figures

Theorems & Definitions (14)

definition 1: Truncated Shifted Discrete Laplace Distribution
Theorem 1
proof
Proposition 1
proof
Lemma 2
Theorem 3
proof
Theorem 4
proof
...and 4 more

Nebula: Efficient, Private and Accurate Histogram Estimation

TL;DR

Abstract

Nebula: Efficient, Private and Accurate Histogram Estimation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (14)