Table of Contents
Fetching ...

GReaTER: Generate Realistic Tabular data after data Enhancement and Reduction

Tung Sum Thomas Kwok, Chi-Hua Wang, Guang Cheng

TL;DR

GReaTER addresses the bottlenecks in realistic multi-modal and multi-table tabular data synthesis by introducing a Data Semantic Enhancement System that differentiates and clarifies categorical labels and a Cross-table Connecting Method that restructures relationships across tables with dimensionality reduction. The approach yields higher fidelity synthetic data than the prior GReaT/DEREC baselines, demonstrated on a multi-table CTR dataset and evaluated with distributional similarity metrics. Key findings show that both semantic enrichment and cross-table collaboration contribute substantially to fidelity, with the understandability-based transformation and threshold-based independence yielding notable gains. The work highlights the value of semantically rich input, cross-table modeling, and argument for employing more powerful LLMs in future work to further improve in-context learning and synthesis quality.

Abstract

Tabular data synthesis involves not only multi-table synthesis but also generating multi-modal data (e.g., strings and categories), which enables diverse knowledge synthesis. However, separating numerical and categorical data has limited the effectiveness of tabular data generation. The GReaT (Generate Realistic Tabular Data) framework uses Large Language Models (LLMs) to encode entire rows, eliminating the need to partition data types. Despite this, the framework's performance is constrained by two issues: (1) tabular data entries lack sufficient semantic meaning, limiting LLM's ability to leverage pre-trained knowledge for in-context learning, and (2) complex multi-table datasets struggle to establish effective relationships for collaboration. To address these, we propose GReaTER (Generate Realistic Tabular Data after data Enhancement and Reduction), which includes: (1) a data semantic enhancement system that improves LLM's understanding of tabular data through mapping, enabling better in-context learning, and (2) a cross-table connecting method to establish efficient relationships across complex tables. Experimental results show that GReaTER outperforms the GReaT framework.

GReaTER: Generate Realistic Tabular data after data Enhancement and Reduction

TL;DR

GReaTER addresses the bottlenecks in realistic multi-modal and multi-table tabular data synthesis by introducing a Data Semantic Enhancement System that differentiates and clarifies categorical labels and a Cross-table Connecting Method that restructures relationships across tables with dimensionality reduction. The approach yields higher fidelity synthetic data than the prior GReaT/DEREC baselines, demonstrated on a multi-table CTR dataset and evaluated with distributional similarity metrics. Key findings show that both semantic enrichment and cross-table collaboration contribute substantially to fidelity, with the understandability-based transformation and threshold-based independence yielding notable gains. The work highlights the value of semantically rich input, cross-table modeling, and argument for employing more powerful LLMs in future work to further improve in-context learning and synthesis quality.

Abstract

Tabular data synthesis involves not only multi-table synthesis but also generating multi-modal data (e.g., strings and categories), which enables diverse knowledge synthesis. However, separating numerical and categorical data has limited the effectiveness of tabular data generation. The GReaT (Generate Realistic Tabular Data) framework uses Large Language Models (LLMs) to encode entire rows, eliminating the need to partition data types. Despite this, the framework's performance is constrained by two issues: (1) tabular data entries lack sufficient semantic meaning, limiting LLM's ability to leverage pre-trained knowledge for in-context learning, and (2) complex multi-table datasets struggle to establish effective relationships for collaboration. To address these, we propose GReaTER (Generate Realistic Tabular Data after data Enhancement and Reduction), which includes: (1) a data semantic enhancement system that improves LLM's understanding of tabular data through mapping, enabling better in-context learning, and (2) a cross-table connecting method to establish efficient relationships across complex tables. Experimental results show that GReaTER outperforms the GReaT framework.

Paper Structure

This paper contains 34 sections, 5 equations, 12 figures, 1 algorithm.

Figures (12)

  • Figure 1: Overview of GReaTER:(1) Extract the parent table for multi-table synthesizer based on contextual variables (2) Improve data semantic level for textual encoder to transform tabular data into semantically meaningful sentence to train LLM. (3) Join the remaining two child tables together while reducing engaged subject bias from direct flattening.
  • Figure 2: A simple example of GReaT implementation: The row observation is textual-encoded in the form 'Name: Grace, Lunch: 1, Dinner: 2, Access Device: 1, Genre: 1', but the LLM would struggle to differentiate between the repeating '1's and would tokenize the different '1's into the same embeddings. To easily follow our work, we exemplify this implementation throughout the paper.
  • Figure 3: Numerical categories are transformed to unique objects to facilitate LLM in understanding the data. Two transformation mappings are proposed, with one focusing only on differentiabiltiy and the other also taking care of data understandability.
  • Figure 4: Logic Flow of the Cross-table Connection Method:(0) Flattening two tables creates (0.1) the dimensionality problem: $2\times 5$ table flattened with a $2 \times 7$ table leads to a $13\times 4$ table, and (0.2) engaged subject bias: engaged subject like 'Yin' dominates the distribution with 8 out of 13 observations. (1) Determine columns with low correlation with all other features (Two methods are proposed) (2) Separate these columns from the table (3) Keep only the unique items and append the column back to the table via bootstrap sampling.
  • Figure 5: Correlation heatmap before and after columns removal
  • ...and 7 more figures