Table of Contents
Fetching ...

TabGen-ICL: Residual-Aware In-Context Example Selection for Tabular Data Generation

Liancheng Fang, Aiwei Liu, Hengrui Zhang, Henry Peng Zou, Weizhi Zhang, Philip S. Yu

TL;DR

TabGen-ICL tackles the problem of generating high-quality synthetic tabular data without fine-tuning large language models. By treating in-context learning as a residual-driven retrieval task, it iteratively selects real data snippets that compensate for the LLM's remaining distributional gaps, guided by $d$-based distances such as $\text{JSD}$ and $\text{KSD}$ and a manageable residual size $n$. Empirical results across five real-world datasets show substantial fidelity improvements (up to $3.5\%$–$42.2\%$) and robustness in data-scarce settings, while preserving privacy by avoiding direct copying of training data. The work demonstrates that prompt-based LLMs, augmented with residual-aware in-context selection and iterative refinement, can achieve competitive synthetic tabular data generation without expensive fine-tuning, with potential practical impact for privacy-preserving data sharing and low-resource scenarios.

Abstract

Large Language models (LLMs) have achieved encouraging results in tabular data generation. However, existing approaches require fine-tuning, which is computationally expensive. This paper explores an alternative: prompting a fixed LLM with in-context examples. We observe that using randomly selected in-context examples hampers the LLM's performance, resulting in sub-optimal generation quality. To address this, we propose a novel in-context learning framework: TabGen-ICL, to enhance the in-context learning ability of LLMs for tabular data generation. TabGen-ICL operates iteratively, retrieving a subset of real samples that represent the residual between currently generated samples and true data distributions. This approach serves two purposes: locally, it provides more effective in-context learning examples for the LLM in each iteration; globally, it progressively narrows the gap between generated and real data. Extensive experiments on five real-world tabular datasets demonstrate that TabGen-ICL significantly outperforms the random selection strategy. Specifically, it reduces the error rate by a margin of $3.5\%-42.2\%$ on fidelity metrics. We demonstrate for the first time that prompting a fixed LLM can yield high-quality synthetic tabular data. The code is provided in the \href{https://github.com/fangliancheng/TabGEN-ICL}{link}.

TabGen-ICL: Residual-Aware In-Context Example Selection for Tabular Data Generation

TL;DR

TabGen-ICL tackles the problem of generating high-quality synthetic tabular data without fine-tuning large language models. By treating in-context learning as a residual-driven retrieval task, it iteratively selects real data snippets that compensate for the LLM's remaining distributional gaps, guided by -based distances such as and and a manageable residual size . Empirical results across five real-world datasets show substantial fidelity improvements (up to ) and robustness in data-scarce settings, while preserving privacy by avoiding direct copying of training data. The work demonstrates that prompt-based LLMs, augmented with residual-aware in-context selection and iterative refinement, can achieve competitive synthetic tabular data generation without expensive fine-tuning, with potential practical impact for privacy-preserving data sharing and low-resource scenarios.

Abstract

Large Language models (LLMs) have achieved encouraging results in tabular data generation. However, existing approaches require fine-tuning, which is computationally expensive. This paper explores an alternative: prompting a fixed LLM with in-context examples. We observe that using randomly selected in-context examples hampers the LLM's performance, resulting in sub-optimal generation quality. To address this, we propose a novel in-context learning framework: TabGen-ICL, to enhance the in-context learning ability of LLMs for tabular data generation. TabGen-ICL operates iteratively, retrieving a subset of real samples that represent the residual between currently generated samples and true data distributions. This approach serves two purposes: locally, it provides more effective in-context learning examples for the LLM in each iteration; globally, it progressively narrows the gap between generated and real data. Extensive experiments on five real-world tabular datasets demonstrate that TabGen-ICL significantly outperforms the random selection strategy. Specifically, it reduces the error rate by a margin of on fidelity metrics. We demonstrate for the first time that prompting a fixed LLM can yield high-quality synthetic tabular data. The code is provided in the \href{https://github.com/fangliancheng/TabGEN-ICL}{link}.

Paper Structure

This paper contains 39 sections, 3 equations, 5 figures, 6 tables, 1 algorithm.

Figures (5)

  • Figure 1: Comparison of samples generated with different in-context learning examples. Plots show the latitude and longitude coordinates of California housing, with the solid line representing the state boundary. (a) 2000 samples generated by LLM with only the table header as input, without any in-context examples. (b) 2000 samples generated by LLM, giving in-context examples sampled from the real dataset. (c) 2000 synthetic samples generated by LLM, giving in-context examples with latitude and longitude in a fixed range. (d) 2000 samples from the ground truth training table.
  • Figure 2: Overview of TabGen-ICL framework. We generate synthetic samples in batches, at each prompt iteration, TabGen-ICL retrieves a subset of real samples that acts as a residual between the currently generated samples and the real data. The residual samples will be used as in-context examples to prompt LLMs in the next iteration. The full prompt template is available in the Appendix \ref{['lst:prompt']}.
  • Figure 3: Quality comparison under data-scarcity. TabGen-ICL and CLLM achieves the highest quality score under the few-shot setting. TabSyn and GReaT fail to generate meaningful data.
  • Figure 4: Privacy comparison: Distributions of the DCR scores between the synthetic dataset and the training/holdout datasets. TabGen-ICL and Curated-LLM (CLLM) are both employed with GPT-4o-mini.
  • Figure 5: Visual comparison: 2D scatter plot of Longitude and Latitude attributes of California dataset. Real represents the original training datasets. All sets are downsampled to 3000 rows for better visualization. TabGen-ICL generates spatially coherent synthetic data that closely matches the distribution of the original dataset.

Theorems & Definitions (3)

  • Definition 1: LLM Generation Distribution
  • Definition 2: Residual
  • Remark 1