GReaTER: Generate Realistic Tabular data after data Enhancement and Reduction
Tung Sum Thomas Kwok, Chi-Hua Wang, Guang Cheng
TL;DR
GReaTER addresses the bottlenecks in realistic multi-modal and multi-table tabular data synthesis by introducing a Data Semantic Enhancement System that differentiates and clarifies categorical labels and a Cross-table Connecting Method that restructures relationships across tables with dimensionality reduction. The approach yields higher fidelity synthetic data than the prior GReaT/DEREC baselines, demonstrated on a multi-table CTR dataset and evaluated with distributional similarity metrics. Key findings show that both semantic enrichment and cross-table collaboration contribute substantially to fidelity, with the understandability-based transformation and threshold-based independence yielding notable gains. The work highlights the value of semantically rich input, cross-table modeling, and argument for employing more powerful LLMs in future work to further improve in-context learning and synthesis quality.
Abstract
Tabular data synthesis involves not only multi-table synthesis but also generating multi-modal data (e.g., strings and categories), which enables diverse knowledge synthesis. However, separating numerical and categorical data has limited the effectiveness of tabular data generation. The GReaT (Generate Realistic Tabular Data) framework uses Large Language Models (LLMs) to encode entire rows, eliminating the need to partition data types. Despite this, the framework's performance is constrained by two issues: (1) tabular data entries lack sufficient semantic meaning, limiting LLM's ability to leverage pre-trained knowledge for in-context learning, and (2) complex multi-table datasets struggle to establish effective relationships for collaboration. To address these, we propose GReaTER (Generate Realistic Tabular Data after data Enhancement and Reduction), which includes: (1) a data semantic enhancement system that improves LLM's understanding of tabular data through mapping, enabling better in-context learning, and (2) a cross-table connecting method to establish efficient relationships across complex tables. Experimental results show that GReaTER outperforms the GReaT framework.
