A Comparison of SynDiffix Multi-table versus Single-table Synthetic Data

Paul Francis

A Comparison of SynDiffix Multi-table versus Single-table Synthetic Data

Paul Francis

TL;DR

Comparing SynDiffix with 15 other commercial and academic synthetic data techniques using the SDNIST analysis framework, modified by us to accommodate multi-table synthetic data, shows that SynDiffix is many times more accurate than other approaches for low-dimension tables, but somewhat worse than the best single-table techniques for high-dimension tables.

Abstract

SynDiffix is a new open-source tool for structured data synthesis. It has anonymization features that allow it to generate multiple synthetic tables while maintaining strong anonymity. Compared to the more common single-table approach, multi-table leads to more accurate data, since only the features of interest for a given analysis need be synthesized. This paper compares SynDiffix with 15 other commercial and academic synthetic data techniques using the SDNIST analysis framework, modified by us to accommodate multi-table synthetic data. The results show that SynDiffix is many times more accurate than other approaches for low-dimension tables, but somewhat worse than the best single-table techniques for high-dimension tables.

A Comparison of SynDiffix Multi-table versus Single-table Synthetic Data

TL;DR

Abstract

Paper Structure (15 sections, 2 equations, 9 figures, 5 tables)

This paper contains 15 sections, 2 equations, 9 figures, 5 tables.

Introduction
Overview of SynDiffix
Setup
Changes to SDNIST
SDNIST measure results
Privacy
Univariate accuracy
Pairs accuracy (correlations)
3-marginal accuracy
Linear Regression (four features)
Propensity mean square error
Principle Component Analysis (all features)
Inconsistencies
Discussion and Conclusion
Additional data

Figures (9)

Figure 1: Improvement factor of SynDiffix over other techniques for each measure (number of measured columns). Techniques with insufficient anonymity or fewer than 24 columns in synthetic table are not comparable and are therefore excluded. Measures with a negative improvement factor (left of the dashed line) are better than SynDiffix. Measures greater than 300x are not shown. Note log scale.
Figure 2: Precision Improvement (PI) and Coverage where the attacker knows the quasi-identifiers of the target, finds a record with a unique and complete match of the quasi-identifiers, and infers an unknown attribute from that record. PI below 0.0 has no privacy loss whatsoever. PI below 0.5 has strong anonymity.
Figure 3: Absolute and composite error for univariate (single feature) counts. The composite error is the minimum of the absolute error and the percent relative error. The values give the median composite error Improvement Factor (IF) for SynDiffix. Box plots show 0, 25, 50, 75, and 100 percentiles plus outliers. Note log scale.
Figure 4: Comparison of SynDiffix and Ananos for univariate counts and pairwise correlation. These figures are taken verbatim from the SDNIST summary reports.
Figure 5: Accuracy of pairwise correlations and 3-marginals. The left plot gives the difference between the original and synthetic data for the Kendall Tau correlation coefficient, and corresponding improvement factors (right y-axis). The right plot gives the sampling rate over the original data that would be required to match the 3-marginal accuracy of the synthetic data. The right y-axis is the improvement factor of the 3-marginal accuracy.
...and 4 more figures

A Comparison of SynDiffix Multi-table versus Single-table Synthetic Data

TL;DR

Abstract

A Comparison of SynDiffix Multi-table versus Single-table Synthetic Data

Authors

TL;DR

Abstract

Table of Contents

Figures (9)