TabuLa: Harnessing Language Models for Tabular Data Synthesis
Zilong Zhao, Robert Birke, Lydia Chen
TL;DR
Tabula addresses privacy-driven constraints on sharing tabular data by introducing a tabular-focused foundation modeling approach that eschews NLP-pretrained weights. By combining token sequence compression with a novel Middle Padding strategy and training on randomly initialized (DistilGPT-2) foundations that are fine-tuned for tabular data, Tabula achieves faster convergence and higher synthetic data utility across six datasets, outperforming five SOTA LLM-based methods. The work demonstrates that random initialization can be more effective than fine-tuned NLP models for tabular synthesis, and shows that iterative cross-task fine-tuning further accelerates convergence and generalization. These findings suggest Tabula as a practical, efficient foundation for generating high-quality synthetic tabular data and for quickly adapting to new tabular synthesis tasks in real-world settings.
Abstract
Tabular data synthesis is crucial for addressing privacy and security concerns in industries reliant on tabular data. While recent advancements adopt large language models (LLMs) for realistic tabular data generation, their long training times and limited reusability hinder practical applications. In this paper, we propose Tabula, a tabular data synthesizer that leverages the structure of LLM. Unlike state-of-the-art (SOTA) LLM-based tabular data synthesizers that rely on pre-trained LLMs, Tabula discards the pre-trained weights originally designed for natural language tasks, focusing instead on a tailored approach for tabular data. In addition, Tabula introduces a token sequence compression strategy that significantly reduces training time while maintaining data quality, alongside a novel token padding method that improves sequence alignment across training batches. Experiments on six datasets show that Tabula achieves superior synthetic data utility compared to current SOTA methods. Additionally, the results demonstrate that Tabula model trained on tabular datasets serves effectively as a foundational model for synthesizing new tabular datasets. Furthermore, the proposed padding method outperforms the conventional left and right padding strategies. Finally, the results highlight that Tabula averagely reduces training time per epoch by 46.2% compared to state-of-the-art LLM approaches while achieving higher data utility. Our code is available at https://github.com/zhao-zilong/Tabula
