Table of Contents
Fetching ...

DEREC-SIMPRO: unlock Language Model benefits to advance Synthesis in Data Clean Room

Tung Sum Thomas Kwok, Chi-hua Wang, Guang Cheng

TL;DR

Results show that using DEREC improves fidelity, and multi-table synthesizers outperform single-table counterparts in collaboration settings, and the DEREC-SIMPRO pipeline offers a robust solution for generalizing data collaboration, promoting a more efficient, data-driven society.

Abstract

Data collaboration via Data Clean Room offers value but raises privacy concerns, which can be addressed through synthetic data and multi-table synthesizers. Common multi-table synthesizers fail to perform when subjects occur repeatedly in both tables. This is an urgent yet unresolved problem, since having both tables with repeating subjects is common. To improve performance in this scenario, we present the DEREC 3-step pre-processing pipeline to generalize adaptability of multi-table synthesizers. We also introduce the SIMPRO 3-aspect evaluation metrics, which leverage conditional distribution and large-scale simultaneous hypothesis testing to provide comprehensive feedback on synthetic data fidelity at both column and table levels. Results show that using DEREC improves fidelity, and multi-table synthesizers outperform single-table counterparts in collaboration settings. Together, the DEREC-SIMPRO pipeline offers a robust solution for generalizing data collaboration, promoting a more efficient, data-driven society.

DEREC-SIMPRO: unlock Language Model benefits to advance Synthesis in Data Clean Room

TL;DR

Results show that using DEREC improves fidelity, and multi-table synthesizers outperform single-table counterparts in collaboration settings, and the DEREC-SIMPRO pipeline offers a robust solution for generalizing data collaboration, promoting a more efficient, data-driven society.

Abstract

Data collaboration via Data Clean Room offers value but raises privacy concerns, which can be addressed through synthetic data and multi-table synthesizers. Common multi-table synthesizers fail to perform when subjects occur repeatedly in both tables. This is an urgent yet unresolved problem, since having both tables with repeating subjects is common. To improve performance in this scenario, we present the DEREC 3-step pre-processing pipeline to generalize adaptability of multi-table synthesizers. We also introduce the SIMPRO 3-aspect evaluation metrics, which leverage conditional distribution and large-scale simultaneous hypothesis testing to provide comprehensive feedback on synthetic data fidelity at both column and table levels. Results show that using DEREC improves fidelity, and multi-table synthesizers outperform single-table counterparts in collaboration settings. Together, the DEREC-SIMPRO pipeline offers a robust solution for generalizing data collaboration, promoting a more efficient, data-driven society.

Paper Structure

This paper contains 32 sections, 6 equations, 16 figures, 2 tables, 1 algorithm.

Figures (16)

  • Figure 1: Proposed DEREC 3-steps pre-processing pipeline and SIMPRO 3-aspects synthetic data evaluation metrics. (1) Party 1 and Party 2 contribute their data in Data Clean Room (2) Construct parent and child columns with DEREC (Section \ref{['sub:derec']}) (3) Generate synthetic data with language model-based multi-table synthesizer (4) Audit the quality of synthetic data through SIMPRO evaluation metrics (Section \ref{['sub:simpro']}): (4.1) Compute statistical similarity between original and synthetic data (Section \ref{['subsub:statistical-similarity']}) (4.2 & 4.3) Compare synthetic data quality generated by different models (Section \ref{['sub:conditional_distribution_indicator']}) using statistical equality test p-values (4.2) and distance metrics (4.3) (5) Augment the first (second) party real data with synthetic second (first) party data (6) Send the augmented tabular data back to both parties to share the data
  • Figure 2: Unique subject
  • Figure 3: Repeating subjects
  • Figure 5: Common multi-table structures before data synthesis (left subgroup) would be transformed into the parent-child structure (right subgroup) through the three-steps procedure (Section \ref{['subsub:detect']}, \ref{['subsub:recreate']}, \ref{['subsub:connect']}).
  • Figure 6: The Statistical Similarity evaluation metrics (Section \ref{['subsub:statistical-similarity']}) reveal significant differences in overall synthetic data distribution compared to the three baselines
  • ...and 11 more figures