AIGT: AI Generative Table Based on Prompt
Mingming Zhang, Zhiqing Xiao, Guoshan Lu, Sai Wu, Weiqiang Wang, Xing Fu, Can Yi, Junbo Zhao
TL;DR
Tabular data are pervasive in enterprise contexts but pose privacy and data-sharing challenges for model development. The authors propose AIGT, a prompt-enhanced language-model framework that leverages table metadata and a long-token partitioning mechanism to synthesize high-quality tabular data, with autoregressive generation modeled by $p(t)=\prod_{i=1}^{N} p(w_i|w_1,...,w_{i-1})$ and the target constraint $p(D_{syn})\approx p(D)$. The approach is pre-trained on a large table corpus (STabs) and fine-tuned on diverse downstream tasks, achieving state-of-the-art performance on 14 of 20 public datasets plus two industrial Alipay datasets, as measured by ML efficiency, DCR, and augmentation gains. This work advances privacy-preserving data sharing and real-world risk-control applications by enabling scalable, metadata-driven tabular data synthesis with improved quality and utility.
Abstract
Tabular data, which accounts for over 80% of enterprise data assets, is vital in various fields. With growing concerns about privacy protection and data-sharing restrictions, generating high-quality synthetic tabular data has become essential. Recent advancements show that large language models (LLMs) can effectively gener-ate realistic tabular data by leveraging semantic information and overcoming the challenges of high-dimensional data that arise from one-hot encoding. However, current methods do not fully utilize the rich information available in tables. To address this, we introduce AI Generative Table (AIGT) based on prompt enhancement, a novel approach that utilizes meta data information, such as table descriptions and schemas, as prompts to generate ultra-high quality synthetic data. To overcome the token limit constraints of LLMs, we propose long-token partitioning algorithms that enable AIGT to model tables of any scale. AIGT achieves state-of-the-art performance on 14 out of 20 public datasets and two real industry datasets within the Alipay risk control system.
