Table of Contents
Fetching ...

Differentially Private Kernel Density Estimation

Erzhi Liu, Jerry Yao-Chieh Hu, Alex Reneau, Zhao Song, Han Liu

TL;DR

This paper tackles privately computing KDE sums over a private dataset $X\subset\mathbb{R}^d$ by designing a refined DP data structure that supports private queries for $y$ through a low-cost decomposition of the KDE sum. Building on the node-contaminated balanced tree, the authors store per-node sums and counts and decompose the 1D KDE into $O(\log n)$ components, each a combination of distance terms and counts, enabling $O(d\log n)$ query time with a $(1,\epsilon^{-1} Rd^{1.5}\log^{1.5} n)$ additive error in the 1D setting and $O(\epsilon^{-1} Rd^{1.5}\log^{1.5} n)$, extended to $d$ dimensions as $O(\epsilon^{-1} Rd^{1.5}\log^{1.5} n)$ per query. The method yields an exact $(1)$-approximation in the $1$-D case and extends to $\ell_2$ and $\ell_p^p$ kernels via dimensionality-reduction strategies, improving both privacy-utility and efficiency relative to prior work. Empirical results corroborate the theoretical gains, showing faster queries and reduced error versus the previous best method BL+24. The approach offers a scalable, privacy-preserving KDE framework for static datasets with potential applicability to synthetic data generation and private data analysis tasks.

Abstract

We introduce a refined differentially private (DP) data structure for kernel density estimation (KDE), offering not only improved privacy-utility tradeoff but also better efficiency over prior results. Specifically, we study the mathematical problem: given a similarity function $f$ (or DP KDE) and a private dataset $X \subset \mathbb{R}^d$, our goal is to preprocess $X$ so that for any query $y\in\mathbb{R}^d$, we approximate $\sum_{x \in X} f(x, y)$ in a differentially private fashion. The best previous algorithm for $f(x,y) =\| x - y \|_1$ is the node-contaminated balanced binary tree by [Backurs, Lin, Mahabadi, Silwal, and Tarnawski, ICLR 2024]. Their algorithm requires $O(nd)$ space and time for preprocessing with $n=|X|$. For any query point, the query time is $d \log n$, with an error guarantee of $(1+α)$-approximation and $ε^{-1} α^{-0.5} d^{1.5} R \log^{1.5} n$. In this paper, we improve the best previous result [Backurs, Lin, Mahabadi, Silwal, and Tarnawski, ICLR 2024] in three aspects: - We reduce query time by a factor of $α^{-1} \log n$. - We improve the approximation ratio from $α$ to 1. - We reduce the error dependence by a factor of $α^{-0.5}$. From a technical perspective, our method of constructing the search tree differs from previous work [Backurs, Lin, Mahabadi, Silwal, and Tarnawski, ICLR 2024]. In prior work, for each query, the answer is split into $α^{-1} \log n$ numbers, each derived from the summation of $\log n$ values in interval tree countings. In contrast, we construct the tree differently, splitting the answer into $\log n$ numbers, where each is a smart combination of two distance values, two counting values, and $y$ itself. We believe our tree structure may be of independent interest.

Differentially Private Kernel Density Estimation

TL;DR

This paper tackles privately computing KDE sums over a private dataset by designing a refined DP data structure that supports private queries for through a low-cost decomposition of the KDE sum. Building on the node-contaminated balanced tree, the authors store per-node sums and counts and decompose the 1D KDE into components, each a combination of distance terms and counts, enabling query time with a additive error in the 1D setting and , extended to dimensions as per query. The method yields an exact -approximation in the -D case and extends to and kernels via dimensionality-reduction strategies, improving both privacy-utility and efficiency relative to prior work. Empirical results corroborate the theoretical gains, showing faster queries and reduced error versus the previous best method BL+24. The approach offers a scalable, privacy-preserving KDE framework for static datasets with potential applicability to synthetic data generation and private data analysis tasks.

Abstract

We introduce a refined differentially private (DP) data structure for kernel density estimation (KDE), offering not only improved privacy-utility tradeoff but also better efficiency over prior results. Specifically, we study the mathematical problem: given a similarity function (or DP KDE) and a private dataset , our goal is to preprocess so that for any query , we approximate in a differentially private fashion. The best previous algorithm for is the node-contaminated balanced binary tree by [Backurs, Lin, Mahabadi, Silwal, and Tarnawski, ICLR 2024]. Their algorithm requires space and time for preprocessing with . For any query point, the query time is , with an error guarantee of -approximation and . In this paper, we improve the best previous result [Backurs, Lin, Mahabadi, Silwal, and Tarnawski, ICLR 2024] in three aspects: - We reduce query time by a factor of . - We improve the approximation ratio from to 1. - We reduce the error dependence by a factor of . From a technical perspective, our method of constructing the search tree differs from previous work [Backurs, Lin, Mahabadi, Silwal, and Tarnawski, ICLR 2024]. In prior work, for each query, the answer is split into numbers, each derived from the summation of values in interval tree countings. In contrast, we construct the tree differently, splitting the answer into numbers, where each is a smart combination of two distance values, two counting values, and itself. We believe our tree structure may be of independent interest.
Paper Structure (33 sections, 16 theorems, 28 equations, 3 figures, 3 tables, 4 algorithms)

This paper contains 33 sections, 16 theorems, 28 equations, 3 figures, 3 tables, 4 algorithms.

Key Result

Theorem 1.1

Given a dataset $X \subset \mathbb{R}^d$ with $|X|=n$. There is an algorithm that uses $O(nd)$ space to build a data-structure which supports the following operations:

Figures (3)

  • Figure 1: Running Time for Different Size $n$
  • Figure 2: Relative Error for Different $\epsilon$
  • Figure 3: Performance for Different $\epsilon$

Theorems & Definitions (36)

  • Definition 1.1: Similarity Error between Two Data Structures
  • Theorem 1.1: Informal Version of Theorem \ref{['thm:main:formal']}
  • Definition 2.1: Pure/Approximate Differential Privacy
  • Lemma 2.1: Advanced Composition Starting from Pure DP Dwork2010
  • Lemma 3.1
  • proof
  • Lemma 3.2: Init Time
  • proof
  • Lemma 3.3: Query Time
  • proof
  • ...and 26 more