Table of Contents
Fetching ...

DP-2Stage: Adapting Language Models as Differentially Private Tabular Data Generators

Tejumade Afonja, Hui-Po Wang, Raouf Kerkouche, Mario Fritz

TL;DR

This work tackles the challenge of generating synthetic tabular data under differential privacy by leveraging pre-trained GPT-2 language models. It introduces DP-2Stage, a two-stage fine-tuning framework where Stage 1 non-privately learns table structure from pseudo data and Stage 2 privates content is learned under DP, with two pseudo-data options: uniform (DP-2Stage-U) and out-of-distribution data (DP-2Stage-O). Empirical results show DP-2Stage improves utility (F1) by 12–25% and fidelity (Hist) by 1–3% over direct DP fine-tuning, while DP-2Stage-U achieves up to 21x faster inference. The approach highlights the importance of decoupling table structure learning from private content learning to make DP training of LLMs more effective, and releases code to foster reproducibility and further research.

Abstract

Generating tabular data under differential privacy (DP) protection ensures theoretical privacy guarantees but poses challenges for training machine learning models, primarily due to the need to capture complex structures under noisy supervision signals. Recently, pre-trained Large Language Models (LLMs) -- even those at the scale of GPT-2 -- have demonstrated great potential in synthesizing tabular data. However, their applications under DP constraints remain largely unexplored. In this work, we address this gap by applying DP techniques to the generation of synthetic tabular data. Our findings shows that LLMs face difficulties in generating coherent text when fine-tuned with DP, as privacy budgets are inefficiently allocated to non-private elements like table structures. To overcome this, we propose DP-2Stage, a two-stage fine-tuning framework for differentially private tabular data generation. The first stage involves non-private fine-tuning on a pseudo dataset, followed by DP fine-tuning on a private dataset. Our empirical results show that this approach improves performance across various settings and metrics compared to directly fine-tuned LLMs in DP contexts. We release our code and setup at https://github.com/tejuafonja/DP-2Stage.

DP-2Stage: Adapting Language Models as Differentially Private Tabular Data Generators

TL;DR

This work tackles the challenge of generating synthetic tabular data under differential privacy by leveraging pre-trained GPT-2 language models. It introduces DP-2Stage, a two-stage fine-tuning framework where Stage 1 non-privately learns table structure from pseudo data and Stage 2 privates content is learned under DP, with two pseudo-data options: uniform (DP-2Stage-U) and out-of-distribution data (DP-2Stage-O). Empirical results show DP-2Stage improves utility (F1) by 12–25% and fidelity (Hist) by 1–3% over direct DP fine-tuning, while DP-2Stage-U achieves up to 21x faster inference. The approach highlights the importance of decoupling table structure learning from private content learning to make DP training of LLMs more effective, and releases code to foster reproducibility and further research.

Abstract

Generating tabular data under differential privacy (DP) protection ensures theoretical privacy guarantees but poses challenges for training machine learning models, primarily due to the need to capture complex structures under noisy supervision signals. Recently, pre-trained Large Language Models (LLMs) -- even those at the scale of GPT-2 -- have demonstrated great potential in synthesizing tabular data. However, their applications under DP constraints remain largely unexplored. In this work, we address this gap by applying DP techniques to the generation of synthetic tabular data. Our findings shows that LLMs face difficulties in generating coherent text when fine-tuned with DP, as privacy budgets are inefficiently allocated to non-private elements like table structures. To overcome this, we propose DP-2Stage, a two-stage fine-tuning framework for differentially private tabular data generation. The first stage involves non-private fine-tuning on a pseudo dataset, followed by DP fine-tuning on a private dataset. Our empirical results show that this approach improves performance across various settings and metrics compared to directly fine-tuned LLMs in DP contexts. We release our code and setup at https://github.com/tejuafonja/DP-2Stage.

Paper Structure

This paper contains 60 sections, 4 theorems, 14 equations, 3 figures, 10 tables.

Key Result

Theorem A.4

mironov2019r. Let $\mathrm{SGM}_{q,\sigma}$ be the Sampled Gaussian mechanism for some function $f$ and under the assumption $\Delta_2 f \leq 1$ for any adjacent $E$, $E' \in {\mathcal{E}}$. Then $\mathrm{SGM}_{q,\sigma}$ satisfies ($\alpha,\rho$)-RDP if where $A_\alpha \overset{\Delta}{=}\mathbb{E}_{z\sim\mu_0}[(\mu(z)/\mu_0(z))^\alpha]$ and $B_\alpha \overset{\Delta}{=}\mathbb{E}_{z\sim\mu}[(\m

Figures (3)

  • Figure 1: Overview of DP-2Stage.In stage 1, the pre-trained LLM is fine-tuned on the respective pseudo data. Subsequently, in stage 2, the model from stage 1 undergoes further fine-tuning using the real private data.
  • Figure 2: Illustration of column shuffling.The order of entries is permuted in each iteration. This mechanism happens at every iteration and has been shown to effectively prevent the model from relying on spurious dependency in Non-DP settings borisov2023language. However, we find that it complicates DP training due to gradient perturbations, often resulting in higher perplexity compared to Non-DP models, as shown in \ref{['fig:motivation']}.
  • Figure 3: DP-2Stage (Ours) vs. Standard DP fine-tuning on the Adult dataset with $\varepsilon=1,\delta=10^{-5}$.DP-2Stage-O refers to the stage 2 model fine-tuned using out-distribution pseudo data (Airline dataset) in stage 1, while DP-2Stage-U is the stage 2 model fine-tuned using data sampled independently from a Uniform distribution, with statistics derived from the Adult dataset as the pseudo-dataset in stage 1. Perplexity results are displayed from left to right for all tokens, values, keys, and non-functional tokens (c.f. \ref{['ssec:lm-tb']}). The top plot shows model trained without column shuffling, and the bottom shows model trained with column shuffling. Total and Value Perplexity for top and bottom plots have fixed y-axis for ease of comparison. The perplexity values are higher with column shuffling (bottom) than without (top).

Theorems & Definitions (10)

  • Definition 3.1: Serialization
  • Definition 3.2
  • Definition 3.3
  • Definition A.1
  • Definition A.2
  • Definition A.3
  • Theorem A.4
  • Theorem A.5
  • Theorem A.6
  • Theorem A.7