Table of Contents
Fetching ...

AIGT: AI Generative Table Based on Prompt

Mingming Zhang, Zhiqing Xiao, Guoshan Lu, Sai Wu, Weiqiang Wang, Xing Fu, Can Yi, Junbo Zhao

TL;DR

Tabular data are pervasive in enterprise contexts but pose privacy and data-sharing challenges for model development. The authors propose AIGT, a prompt-enhanced language-model framework that leverages table metadata and a long-token partitioning mechanism to synthesize high-quality tabular data, with autoregressive generation modeled by $p(t)=\prod_{i=1}^{N} p(w_i|w_1,...,w_{i-1})$ and the target constraint $p(D_{syn})\approx p(D)$. The approach is pre-trained on a large table corpus (STabs) and fine-tuned on diverse downstream tasks, achieving state-of-the-art performance on 14 of 20 public datasets plus two industrial Alipay datasets, as measured by ML efficiency, DCR, and augmentation gains. This work advances privacy-preserving data sharing and real-world risk-control applications by enabling scalable, metadata-driven tabular data synthesis with improved quality and utility.

Abstract

Tabular data, which accounts for over 80% of enterprise data assets, is vital in various fields. With growing concerns about privacy protection and data-sharing restrictions, generating high-quality synthetic tabular data has become essential. Recent advancements show that large language models (LLMs) can effectively gener-ate realistic tabular data by leveraging semantic information and overcoming the challenges of high-dimensional data that arise from one-hot encoding. However, current methods do not fully utilize the rich information available in tables. To address this, we introduce AI Generative Table (AIGT) based on prompt enhancement, a novel approach that utilizes meta data information, such as table descriptions and schemas, as prompts to generate ultra-high quality synthetic data. To overcome the token limit constraints of LLMs, we propose long-token partitioning algorithms that enable AIGT to model tables of any scale. AIGT achieves state-of-the-art performance on 14 out of 20 public datasets and two real industry datasets within the Alipay risk control system.

AIGT: AI Generative Table Based on Prompt

TL;DR

Tabular data are pervasive in enterprise contexts but pose privacy and data-sharing challenges for model development. The authors propose AIGT, a prompt-enhanced language-model framework that leverages table metadata and a long-token partitioning mechanism to synthesize high-quality tabular data, with autoregressive generation modeled by and the target constraint . The approach is pre-trained on a large table corpus (STabs) and fine-tuned on diverse downstream tasks, achieving state-of-the-art performance on 14 of 20 public datasets plus two industrial Alipay datasets, as measured by ML efficiency, DCR, and augmentation gains. This work advances privacy-preserving data sharing and real-world risk-control applications by enabling scalable, metadata-driven tabular data synthesis with improved quality and utility.

Abstract

Tabular data, which accounts for over 80% of enterprise data assets, is vital in various fields. With growing concerns about privacy protection and data-sharing restrictions, generating high-quality synthetic tabular data has become essential. Recent advancements show that large language models (LLMs) can effectively gener-ate realistic tabular data by leveraging semantic information and overcoming the challenges of high-dimensional data that arise from one-hot encoding. However, current methods do not fully utilize the rich information available in tables. To address this, we introduce AI Generative Table (AIGT) based on prompt enhancement, a novel approach that utilizes meta data information, such as table descriptions and schemas, as prompts to generate ultra-high quality synthetic data. To overcome the token limit constraints of LLMs, we propose long-token partitioning algorithms that enable AIGT to model tables of any scale. AIGT achieves state-of-the-art performance on 14 out of 20 public datasets and two real industry datasets within the Alipay risk control system.

Paper Structure

This paper contains 35 sections, 1 equation, 7 figures, 9 tables.

Figures (7)

  • Figure 1: The architecture of the proposed AIGT. Firstly, AIGT utilizes the collected pre-trained corpus for pre-training; Then perform fine-tuning training on each downstream table to learn the complex relationships between features; Finally, based on the trained language model model, sample rows can be composed.
  • Figure 2: Training Strategies for AIGT. Here, Prompt's losses are not calculated, the label feature was fixed in the first place, the other features were pre-mutated randomly.
  • Figure 3: The Process of Long Token Partition Algorithm in AIGT. Divide the table into sub-tables based on columns, with overlaps between the columns of the sub-tables. Use the data from the sub-tables to train and generate the model.
  • Figure 4: Distance to closest record (DCR) distribution for the California Housing dataset. “Original” denotes the DCR of the original test set with respect to the original train set. The experimental results illustrate that each method does not copy samples from the train set.
  • Figure 5: Ablation experiments related to training strategies. The y-axis is the average metric values across all datasets with LightGBM.
  • ...and 2 more figures