Table of Contents
Fetching ...

Privacy-Enhanced Database Synthesis for Benchmark Publishing (Technical Report)

Yunqing Ge, Jianbin Qin, Shuyuan Zheng, Yongrui Zhong, Bo Tang, Yu-Xuan Qiu, Rui Mao, Ye Yuan, Makoto Onizuka, Chuan Xiao

TL;DR

PrivBench addresses the challenge of publishing privacy-preserving benchmarks by synthesizing high-fidelity multi-relational databases under database-level DP. It introduces a novel SPN-based framework that builds differentially private SPNs per table, augments them with private fanout information to model FK references, and samples from the modified SPNs to generate synthetic data. Theoretical analysis proves DP guarantees and polynomial-time complexity, while extensive experiments on multiple datasets show PrivBench achieves superior data-distribution fidelity (low KL divergence) and query-workload fidelity (low Q-error) with competitive synthesis times, especially under tight privacy budgets. This work lays a foundation for practical, privacy-protecting benchmark publishing and potential data-trading scenarios.

Abstract

Benchmarking is crucial for evaluating a DBMS, yet existing benchmarks often fail to reflect the varied nature of user workloads. As a result, there is increasing momentum toward creating databases that incorporate real-world user data to more accurately mirror business environments. However, privacy concerns deter users from directly sharing their data, underscoring the importance of creating synthesized databases for benchmarking that also prioritize privacy protection. Differential privacy (DP)-based data synthesis has become a key method for safeguarding privacy when sharing data, but the focus has largely been on minimizing errors in aggregate queries or downstream ML tasks, with less attention given to benchmarking factors like query runtime performance. This paper delves into differentially private database synthesis specifically for benchmark publishing scenarios, aiming to produce a synthetic database whose benchmarking factors closely resemble those of the original data. Introducing \textit{PrivBench}, an innovative synthesis framework based on sum-product networks (SPNs), we support the synthesis of high-quality benchmark databases that maintain fidelity in both data distribution and query runtime performance while preserving privacy. We validate that PrivBench can ensure database-level DP even when generating multi-relation databases with complex reference relationships. Our extensive experiments show that PrivBench efficiently synthesizes data that maintains privacy and excels in both data distribution similarity and query runtime similarity.

Privacy-Enhanced Database Synthesis for Benchmark Publishing (Technical Report)

TL;DR

PrivBench addresses the challenge of publishing privacy-preserving benchmarks by synthesizing high-fidelity multi-relational databases under database-level DP. It introduces a novel SPN-based framework that builds differentially private SPNs per table, augments them with private fanout information to model FK references, and samples from the modified SPNs to generate synthetic data. Theoretical analysis proves DP guarantees and polynomial-time complexity, while extensive experiments on multiple datasets show PrivBench achieves superior data-distribution fidelity (low KL divergence) and query-workload fidelity (low Q-error) with competitive synthesis times, especially under tight privacy budgets. This work lays a foundation for practical, privacy-protecting benchmark publishing and potential data-trading scenarios.

Abstract

Benchmarking is crucial for evaluating a DBMS, yet existing benchmarks often fail to reflect the varied nature of user workloads. As a result, there is increasing momentum toward creating databases that incorporate real-world user data to more accurately mirror business environments. However, privacy concerns deter users from directly sharing their data, underscoring the importance of creating synthesized databases for benchmarking that also prioritize privacy protection. Differential privacy (DP)-based data synthesis has become a key method for safeguarding privacy when sharing data, but the focus has largely been on minimizing errors in aggregate queries or downstream ML tasks, with less attention given to benchmarking factors like query runtime performance. This paper delves into differentially private database synthesis specifically for benchmark publishing scenarios, aiming to produce a synthetic database whose benchmarking factors closely resemble those of the original data. Introducing \textit{PrivBench}, an innovative synthesis framework based on sum-product networks (SPNs), we support the synthesis of high-quality benchmark databases that maintain fidelity in both data distribution and query runtime performance while preserving privacy. We validate that PrivBench can ensure database-level DP even when generating multi-relation databases with complex reference relationships. Our extensive experiments show that PrivBench efficiently synthesizes data that maintains privacy and excels in both data distribution similarity and query runtime similarity.
Paper Structure (30 sections, 10 theorems, 9 equations, 4 figures, 5 tables, 7 algorithms)

This paper contains 30 sections, 10 theorems, 9 equations, 4 figures, 5 tables, 7 algorithms.

Key Result

Lemma 1

If $|attr(T)| > 1$, ${ {\mathsf{Planning}}}\xspace(T, \epsilon)$ satisfies table-level $(\epsilon\cdot \gamma_1 / \sigma(T))$-DP; otherwise, it satisfies table-level $0$-DP.

Figures (4)

  • Figure 1: Learning a single-relation SPN and generating data based on the SPN.
  • Figure 2: A database consisting of two private tables.
  • Figure 3: Fanout table construction.
  • Figure 4: An example of Query Heat Map for a given query. The edge values indicate the proportions of rows assigned to each subtree of a sum node, while the node values represent the selectivities for the given query.

Theorems & Definitions (13)

  • Definition 1: Table-Level DP dwork2014algorithmic
  • Definition 2: Database-Level DP cai2023privlava
  • Definition 3: Normalized Mututal Information
  • Lemma 1
  • Lemma 2
  • Theorem 1: Parallel composition under bounded DP
  • Lemma 3
  • Lemma 4
  • Theorem 2
  • Lemma 5
  • ...and 3 more