Data Plagiarism Index: Characterizing the Privacy Risk of Data-Copying in Tabular Generative Models

Joshua Ward; Chi-Hua Wang; Guang Cheng

Data Plagiarism Index: Characterizing the Privacy Risk of Data-Copying in Tabular Generative Models

Joshua Ward, Chi-Hua Wang, Guang Cheng

TL;DR

This work addresses the privacy risks of data-copying in tabular generative models by introducing the Data Plagiarism Index (DPI), a local-copying measure that is grounded in a privacy threat model and paired with a Data Plagiarism Membership Inference Attack (DPI MIA). By formalizing data-copying, defining a proportion-based DPI over a target point's neighborhood, and linking it to MIAs, the authors enable direct auditing of privacy risk in high-dimensional, mixed-type tabular data. Empirical results on the Adult dataset show that high-fidelity generators tend to copy training data more, with notable fairness concerns as DPI identifies outlier privileged sub-populations being copied. DPI also offers a complementary attack signal to existing MIAs, supporting its use as a practical privacy and fairness auditing tool for synthetic data generation and informing future work on differential privacy integration and robustness.',

Abstract

The promise of tabular generative models is to produce realistic synthetic data that can be shared and safely used without dangerous leakage of information from the training set. In evaluating these models, a variety of methods have been proposed to measure the tendency to copy data from the training dataset when generating a sample. However, these methods suffer from either not considering data-copying from a privacy threat perspective, not being motivated by recent results in the data-copying literature or being difficult to make compatible with the high dimensional, mixed type nature of tabular data. This paper proposes a new similarity metric and Membership Inference Attack called Data Plagiarism Index (DPI) for tabular data. We show that DPI evaluates a new intuitive definition of data-copying and characterizes the corresponding privacy risk. We show that the data-copying identified by DPI poses both privacy and fairness threats to common, high performing architectures; underscoring the necessity for more sophisticated generative modeling techniques to mitigate this issue.

Data Plagiarism Index: Characterizing the Privacy Risk of Data-Copying in Tabular Generative Models

TL;DR

Abstract

Paper Structure (25 sections, 3 equations, 7 figures, 1 table)

This paper contains 25 sections, 3 equations, 7 figures, 1 table.

Introduction
Related Work
Measure Data-Copying in Generative Models
Similarity Metrics between Real and Synthetic Data
Membership Inference Attacks for Generative Models
Preliminaries
Formal definition of Data-Copying
Distance to Closest Records (DCR)
Membership Inference Attacks as Privacy Auditors
Measuring Data-Copying Misbehavior
Data Plagiarism Index (DPI)
Data Plagiarism Membership Inference Attack
Results
Experiment Setup
Data-Copying in Tabular Data Generators
...and 10 more sections

Figures (7)

Figure 1: A t-SNE plot of Tab-DDPM's training data with corresponding top 1% DPI Scores in red on the Adult dataset. DPI identifies an outlier region in the bottom right corresponding to an extreme privileged class (married, white, middle aged, high capital gains, private industry, respondents making $>$50k in income). This provides evidence that Tab-DDPM copies the training data of outlier and privileged classes, creating serious fairness concerns for practitioners who use synthetic data in their downstream machine learning tasks. See Sec. \ref{['subsec:Train_Data_High_DPI']} for detailed discussions.
Figure 2: Data Plagiarism Index (DPI): We propose a novel privacy metric named Data Plagiarism Index (DPI). For each target data point (black point), we calculate the Data Plagiarism Index $\rho$ by first construct a k-nearest neighborhood (blue circle) around the target data point on the space with reference data points (green points) and synthetic data points (red points). The Data Plagiarism Index $\rho$ is defined simply as the ratio of number of synthetic data points to the number of reference data points. See Sec. \ref{['subsec:Data_Plag_Index']} for whole details.
Figure 3: Data Plagiarism Index Membership Inference Attack (DPI MIA) See Sec. \ref{['subsec:DPIMIA']} for full details.
Figure 4: Classifier AUCROC, Maximum Mean Discrepancy, and Wasserstein Distance plotted with corresponding DPI MIA AUCROC for various common architectures. Interestingly, Bayesian Network, Adversarial Random Forest, and Tab-DDPM outperform other models in these performance metrics but have higher privacy risk. See Sec. \ref{['subsec:data_copy_tabu_data_generator']} for full details.
Figure 5: MIA AUCROC Benchmarks by Training Set Size on Tab-DDPM. This shows that DPI MIA is more effective than other existing MIAs described at Sec. \ref{['subsec:ref_MIA']}. See Sec. \ref{['subsec:benchmarkDPI']} for full details.
...and 2 more figures

Data Plagiarism Index: Characterizing the Privacy Risk of Data-Copying in Tabular Generative Models

TL;DR

Abstract

Data Plagiarism Index: Characterizing the Privacy Risk of Data-Copying in Tabular Generative Models

Authors

TL;DR

Abstract

Table of Contents

Figures (7)