TabGen-ICL: Residual-Aware In-Context Example Selection for Tabular Data Generation
Liancheng Fang, Aiwei Liu, Hengrui Zhang, Henry Peng Zou, Weizhi Zhang, Philip S. Yu
TL;DR
TabGen-ICL tackles the problem of generating high-quality synthetic tabular data without fine-tuning large language models. By treating in-context learning as a residual-driven retrieval task, it iteratively selects real data snippets that compensate for the LLM's remaining distributional gaps, guided by $d$-based distances such as $\text{JSD}$ and $\text{KSD}$ and a manageable residual size $n$. Empirical results across five real-world datasets show substantial fidelity improvements (up to $3.5\%$–$42.2\%$) and robustness in data-scarce settings, while preserving privacy by avoiding direct copying of training data. The work demonstrates that prompt-based LLMs, augmented with residual-aware in-context selection and iterative refinement, can achieve competitive synthetic tabular data generation without expensive fine-tuning, with potential practical impact for privacy-preserving data sharing and low-resource scenarios.
Abstract
Large Language models (LLMs) have achieved encouraging results in tabular data generation. However, existing approaches require fine-tuning, which is computationally expensive. This paper explores an alternative: prompting a fixed LLM with in-context examples. We observe that using randomly selected in-context examples hampers the LLM's performance, resulting in sub-optimal generation quality. To address this, we propose a novel in-context learning framework: TabGen-ICL, to enhance the in-context learning ability of LLMs for tabular data generation. TabGen-ICL operates iteratively, retrieving a subset of real samples that represent the residual between currently generated samples and true data distributions. This approach serves two purposes: locally, it provides more effective in-context learning examples for the LLM in each iteration; globally, it progressively narrows the gap between generated and real data. Extensive experiments on five real-world tabular datasets demonstrate that TabGen-ICL significantly outperforms the random selection strategy. Specifically, it reduces the error rate by a margin of $3.5\%-42.2\%$ on fidelity metrics. We demonstrate for the first time that prompting a fixed LLM can yield high-quality synthetic tabular data. The code is provided in the \href{https://github.com/fangliancheng/TabGEN-ICL}{link}.
