Table of Contents
Fetching ...

CuTS: Customizable Tabular Synthetic Data Generation

Mark Vero, Mislav Balunović, Martin Vechev

TL;DR

CuTS addresses the challenge of sharing tabular data while preserving privacy and reducing bias by enabling customizable synthetic data generation. It pre-trains a generative model on the original data and then fine-tunes it with differentiable relaxations derived from declarative specifications spanning privacy, logical constraints, statistics, and downstream objectives. The approach delivers strong results across multiple datasets, achieving state-of-the-art fairness gains (e.g., on Adult) and demonstrating robust composability when combining diverse constraints. By enabling broad, programmable customization with preserved utility, CuTS has practical implications for privacy-conscious data sharing in high-stakes domains.

Abstract

Privacy, data quality, and data sharing concerns pose a key limitation for tabular data applications. While generating synthetic data resembling the original distribution addresses some of these issues, most applications would benefit from additional customization on the generated data. However, existing synthetic data approaches are limited to particular constraints, e.g., differential privacy (DP) or fairness. In this work, we introduce CuTS, the first customizable synthetic tabular data generation framework. Customization in CuTS is achieved via declarative statistical and logical expressions, supporting a wide range of requirements (e.g., DP or fairness, among others). To ensure high synthetic data quality in the presence of custom specifications, CuTS is pre-trained on the original dataset and fine-tuned on a differentiable loss automatically derived from the provided specifications using novel relaxations. We evaluate CuTS over four datasets and on numerous custom specifications, outperforming state-of-the-art specialized approaches on several tasks while being more general. In particular, at the same fairness level, we achieve 2.3% higher downstream accuracy than the state-of-the-art in fair synthetic data generation on the Adult dataset.

CuTS: Customizable Tabular Synthetic Data Generation

TL;DR

CuTS addresses the challenge of sharing tabular data while preserving privacy and reducing bias by enabling customizable synthetic data generation. It pre-trains a generative model on the original data and then fine-tunes it with differentiable relaxations derived from declarative specifications spanning privacy, logical constraints, statistics, and downstream objectives. The approach delivers strong results across multiple datasets, achieving state-of-the-art fairness gains (e.g., on Adult) and demonstrating robust composability when combining diverse constraints. By enabling broad, programmable customization with preserved utility, CuTS has practical implications for privacy-conscious data sharing in high-stakes domains.

Abstract

Privacy, data quality, and data sharing concerns pose a key limitation for tabular data applications. While generating synthetic data resembling the original distribution addresses some of these issues, most applications would benefit from additional customization on the generated data. However, existing synthetic data approaches are limited to particular constraints, e.g., differential privacy (DP) or fairness. In this work, we introduce CuTS, the first customizable synthetic tabular data generation framework. Customization in CuTS is achieved via declarative statistical and logical expressions, supporting a wide range of requirements (e.g., DP or fairness, among others). To ensure high synthetic data quality in the presence of custom specifications, CuTS is pre-trained on the original dataset and fine-tuned on a differentiable loss automatically derived from the provided specifications using novel relaxations. We evaluate CuTS over four datasets and on numerous custom specifications, outperforming state-of-the-art specialized approaches on several tasks while being more general. In particular, at the same fairness level, we achieve 2.3% higher downstream accuracy than the state-of-the-art in fair synthetic data generation on the Adult dataset.
Paper Structure (69 sections, 10 equations, 3 figures, 23 tables, 1 algorithm)

This paper contains 69 sections, 10 equations, 3 figures, 23 tables, 1 algorithm.

Figures (3)

  • Figure 1: An overview of CuTS. The data owner writes a program that lists specifications for the synthetic data. For example, they might want to make sure that the model does not generate people younger than 25 with a Doctorate degree. Additionally, they might require that the synthetic data is differentially private and unbiased. To achieve this, CuTS pre-trains a differentially private generative model, and then fine-tunes it to adhere to the given specifications. Finally, the generative model can be used to sample a synthetic dataset with the desired properties.
  • Figure 2: CuTS obfuscating the distribution using statistical manipulations, while only losing $\approx 1\%$ accuracy.
  • Figure 3: A CuTS program on the Adult dataset containing example commands for each supported constraint type.