Differentially Private Kernel Density Estimation

Erzhi Liu; Jerry Yao-Chieh Hu; Alex Reneau; Zhao Song; Han Liu

Differentially Private Kernel Density Estimation

Erzhi Liu, Jerry Yao-Chieh Hu, Alex Reneau, Zhao Song, Han Liu

TL;DR

This paper tackles privately computing KDE sums over a private dataset $X\subset\mathbb{R}^d$ by designing a refined DP data structure that supports private queries for $y$ through a low-cost decomposition of the KDE sum. Building on the node-contaminated balanced tree, the authors store per-node sums and counts and decompose the 1D KDE into $O(\log n)$ components, each a combination of distance terms and counts, enabling $O(d\log n)$ query time with a $(1,\epsilon^{-1} Rd^{1.5}\log^{1.5} n)$ additive error in the 1D setting and $O(\epsilon^{-1} Rd^{1.5}\log^{1.5} n)$, extended to $d$ dimensions as $O(\epsilon^{-1} Rd^{1.5}\log^{1.5} n)$ per query. The method yields an exact $(1)$-approximation in the $1$-D case and extends to $\ell_2$ and $\ell_p^p$ kernels via dimensionality-reduction strategies, improving both privacy-utility and efficiency relative to prior work. Empirical results corroborate the theoretical gains, showing faster queries and reduced error versus the previous best method BL+24. The approach offers a scalable, privacy-preserving KDE framework for static datasets with potential applicability to synthetic data generation and private data analysis tasks.

Abstract

We introduce a refined differentially private (DP) data structure for kernel density estimation (KDE), offering not only improved privacy-utility tradeoff but also better efficiency over prior results. Specifically, we study the mathematical problem: given a similarity function $f$ (or DP KDE) and a private dataset $X \subset \mathbb{R}^d$, our goal is to preprocess $X$ so that for any query $y\in\mathbb{R}^d$, we approximate $\sum_{x \in X} f(x, y)$ in a differentially private fashion. The best previous algorithm for $f(x,y) =\| x - y \|_1$ is the node-contaminated balanced binary tree by [Backurs, Lin, Mahabadi, Silwal, and Tarnawski, ICLR 2024]. Their algorithm requires $O(nd)$ space and time for preprocessing with $n=|X|$. For any query point, the query time is $d \log n$, with an error guarantee of $(1+α)$-approximation and $ε^{-1} α^{-0.5} d^{1.5} R \log^{1.5} n$. In this paper, we improve the best previous result [Backurs, Lin, Mahabadi, Silwal, and Tarnawski, ICLR 2024] in three aspects: - We reduce query time by a factor of $α^{-1} \log n$. - We improve the approximation ratio from $α$ to 1. - We reduce the error dependence by a factor of $α^{-0.5}$. From a technical perspective, our method of constructing the search tree differs from previous work [Backurs, Lin, Mahabadi, Silwal, and Tarnawski, ICLR 2024]. In prior work, for each query, the answer is split into $α^{-1} \log n$ numbers, each derived from the summation of $\log n$ values in interval tree countings. In contrast, we construct the tree differently, splitting the answer into $\log n$ numbers, where each is a smart combination of two distance values, two counting values, and $y$ itself. We believe our tree structure may be of independent interest.

Differentially Private Kernel Density Estimation

TL;DR

This paper tackles privately computing KDE sums over a private dataset

by designing a refined DP data structure that supports private queries for

through a low-cost decomposition of the KDE sum. Building on the node-contaminated balanced tree, the authors store per-node sums and counts and decompose the 1D KDE into

components, each a combination of distance terms and counts, enabling

query time with a

additive error in the 1D setting and

, extended to

dimensions as

per query. The method yields an exact

-approximation in the

-D case and extends to

and

kernels via dimensionality-reduction strategies, improving both privacy-utility and efficiency relative to prior work. Empirical results corroborate the theoretical gains, showing faster queries and reduced error versus the previous best method BL+24. The approach offers a scalable, privacy-preserving KDE framework for static datasets with potential applicability to synthetic data generation and private data analysis tasks.

Abstract

(or DP KDE) and a private dataset

, our goal is to preprocess

so that for any query

, we approximate

in a differentially private fashion. The best previous algorithm for

is the node-contaminated balanced binary tree by [Backurs, Lin, Mahabadi, Silwal, and Tarnawski, ICLR 2024]. Their algorithm requires

space and time for preprocessing with

. For any query point, the query time is

, with an error guarantee of

-approximation and

. In this paper, we improve the best previous result [Backurs, Lin, Mahabadi, Silwal, and Tarnawski, ICLR 2024] in three aspects: - We reduce query time by a factor of

. - We improve the approximation ratio from

to 1. - We reduce the error dependence by a factor of

. From a technical perspective, our method of constructing the search tree differs from previous work [Backurs, Lin, Mahabadi, Silwal, and Tarnawski, ICLR 2024]. In prior work, for each query, the answer is split into

numbers, each derived from the summation of

values in interval tree countings. In contrast, we construct the tree differently, splitting the answer into

numbers, where each is a smart combination of two distance values, two counting values, and

itself. We believe our tree structure may be of independent interest.

Paper Structure (33 sections, 16 theorems, 28 equations, 3 figures, 3 tables, 4 algorithms)

This paper contains 33 sections, 16 theorems, 28 equations, 3 figures, 3 tables, 4 algorithms.

Introduction
Related Work.
Organization.
Preliminaries
Differential Privacy
's DP Data Structure: Node-Contaminated Balanced Tree
High-Level Overview of Our DP Data Structure
A Refined Differentially Private Data Structure
Key Observation and New Data Structure for One Dimensional KDE Query
Time Complexity
Privacy Guarantees
Error Guarantee
One Dimensional Differentially Private Data Structure
High Dimensional Distance Query
Proof-of-Concept Experiments
...and 18 more sections

Key Result

Theorem 1.1

Given a dataset $X \subset \mathbb{R}^d$ with $|X|=n$. There is an algorithm that uses $O(nd)$ space to build a data-structure which supports the following operations:

Figures (3)

Figure 1: Running Time for Different Size $n$
Figure 2: Relative Error for Different $\epsilon$
Figure 3: Performance for Different $\epsilon$

Theorems & Definitions (36)

Definition 1.1: Similarity Error between Two Data Structures
Theorem 1.1: Informal Version of Theorem \ref{['thm:main:formal']}
Definition 2.1: Pure/Approximate Differential Privacy
Lemma 2.1: Advanced Composition Starting from Pure DP Dwork2010
Lemma 3.1
proof
Lemma 3.2: Init Time
proof
Lemma 3.3: Query Time
proof
...and 26 more

Differentially Private Kernel Density Estimation

TL;DR

Abstract

Differentially Private Kernel Density Estimation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (36)