Table of Contents
Fetching ...

When Tables Leak: Attacking String Memorization in LLM-Based Tabular Data Generation

Joshua Ward, Bochao Gu, Chi-Hua Wang, Guang Cheng

TL;DR

The paper reveals a novel privacy vulnerability in LLM-based tabular data generation: memorized digit strings can be leaked through synthetic outputs. It introduces LevAtt, a Levenshtein-distance-based no-box membership inference attack that targets string representations, and shows substantial leakage across both in-context learning and supervised fine-tuning regimes, scaling with model size and data volume. To mitigate this risk, the authors propose Digit Modifier and Tendency-based Logit Processor, with TLP providing effective privacy protection while preserving data fidelity and downstream utility. The work highlights the need for dedicated privacy auditing of LLM-based tabular generators and points toward defense strategies that integrate with generation-time dynamics and offer provable privacy guarantees in future research.

Abstract

Large Language Models (LLMs) have recently demonstrated remarkable performance in generating high-quality tabular synthetic data. In practice, two primary approaches have emerged for adapting LLMs to tabular data generation: (i) fine-tuning smaller models directly on tabular datasets, and (ii) prompting larger models with examples provided in context. In this work, we show that popular implementations from both regimes exhibit a tendency to compromise privacy by reproducing memorized patterns of numeric digits from their training data. To systematically analyze this risk, we introduce a simple No-box Membership Inference Attack (MIA) called LevAtt that assumes adversarial access to only the generated synthetic data and targets the string sequences of numeric digits in synthetic observations. Using this approach, our attack exposes substantial privacy leakage across a wide range of models and datasets, and in some cases, is even a perfect membership classifier on state-of-the-art models. Our findings highlight a unique privacy vulnerability of LLM-based synthetic data generation and the need for effective defenses. To this end, we propose two methods, including a novel sampling strategy that strategically perturbs digits during generation. Our evaluation demonstrates that this approach can defeat these attacks with minimal loss of fidelity and utility of the synthetic data.

When Tables Leak: Attacking String Memorization in LLM-Based Tabular Data Generation

TL;DR

The paper reveals a novel privacy vulnerability in LLM-based tabular data generation: memorized digit strings can be leaked through synthetic outputs. It introduces LevAtt, a Levenshtein-distance-based no-box membership inference attack that targets string representations, and shows substantial leakage across both in-context learning and supervised fine-tuning regimes, scaling with model size and data volume. To mitigate this risk, the authors propose Digit Modifier and Tendency-based Logit Processor, with TLP providing effective privacy protection while preserving data fidelity and downstream utility. The work highlights the need for dedicated privacy auditing of LLM-based tabular generators and points toward defense strategies that integrate with generation-time dynamics and offer provable privacy guarantees in future research.

Abstract

Large Language Models (LLMs) have recently demonstrated remarkable performance in generating high-quality tabular synthetic data. In practice, two primary approaches have emerged for adapting LLMs to tabular data generation: (i) fine-tuning smaller models directly on tabular datasets, and (ii) prompting larger models with examples provided in context. In this work, we show that popular implementations from both regimes exhibit a tendency to compromise privacy by reproducing memorized patterns of numeric digits from their training data. To systematically analyze this risk, we introduce a simple No-box Membership Inference Attack (MIA) called LevAtt that assumes adversarial access to only the generated synthetic data and targets the string sequences of numeric digits in synthetic observations. Using this approach, our attack exposes substantial privacy leakage across a wide range of models and datasets, and in some cases, is even a perfect membership classifier on state-of-the-art models. Our findings highlight a unique privacy vulnerability of LLM-based synthetic data generation and the need for effective defenses. To this end, we propose two methods, including a novel sampling strategy that strategically perturbs digits during generation. Our evaluation demonstrates that this approach can defeat these attacks with minimal loss of fidelity and utility of the synthetic data.

Paper Structure

This paper contains 34 sections, 8 equations, 10 figures, 4 tables, 2 algorithms.

Figures (10)

  • Figure 1: Diagram of Levenshtein Attack. We simply encode rows of tabular data into a string representation from which to attack. LevAtt finds signal in the highly constrained and often duplicated sequences of digits in synthetic tabular data generated by LLMs. In bold and underline: copied sequences of such patterns. Where these rows would be relatively far in Euclidean distance, repeated sequences in their string representations are the source of LevAtt's adversarial advantage.
  • Figure 2: ROC plot for various No-box MIAs against TabPFN-V2 with 128 in-context samples from the MoneyBall dataset. LevAtt (blue) is able to achieve perfect classification for all in-context samples whereas MIAs that target the feature space of tabular data fail to capture the privacy leakage.
  • Figure 3: Correlation plot for No-box MIA AUC-ROC across the ICL experiment. While the feature-space targeting DCR, Density Estimate, and MC are nearly perfectly correlated, LevAtt is much less correlated. This highlights that while privacy leakage over tabular string representations and the feature space are related, LevAtt finds unique adversarial advantage.
  • Figure 4: LevAtt AUC-ROC for various datasets generated by RealTabFormer with increasing synthetic dataset sizes relative to the training set.
  • Figure 5: LevAtt performance on RealTabFormer at various training digit sequence lengths.
  • ...and 5 more figures