Table of Contents
Fetching ...

Large Scale Transfer Learning for Tabular Data via Language Modeling

Josh Gardner, Juan C. Perdomo, Ludwig Schmidt

TL;DR

TabuLa-8B introduces a transformer-based approach to tabular data prediction by fine-tuning Llama 3-8B on a vast, filtered tabular corpus (T4) using a novel row-causal masking and serialization scheme. The Tremendous TabLib Trawl provides about 4 million tables with over 2.1 billion rows and 100 billion tokens to enable large-scale transfer learning for tabular data. Across 329 datasets, TabuLa-8B demonstrates strong zero-shot and few-shot transfer, outperforming state-of-the-art baselines and showing notable sample efficiency, all while enabling open-source reproducibility. The work highlights both the promise and practical considerations of tabular foundation models and lays out a concrete path for future research and safer, transparent deployment.

Abstract

Tabular data -- structured, heterogeneous, spreadsheet-style data with rows and columns -- is widely used in practice across many domains. However, while recent foundation models have reduced the need for developing task-specific datasets and predictors in domains such as language modeling and computer vision, this transfer learning paradigm has not had similar impact in the tabular domain. In this work, we seek to narrow this gap and present TabuLa-8B, a language model for tabular prediction. We define a process for extracting a large, high-quality training dataset from the TabLib corpus, proposing methods for tabular data filtering and quality control. Using the resulting dataset, which comprises over 2.1B rows from over 4M unique tables, we fine-tune a Llama 3-8B large language model (LLM) for tabular data prediction (classification and binned regression) using a novel packing and attention scheme for tabular prediction. Through evaluation across a test suite of 329 datasets, we find that TabuLa-8B has zero-shot accuracy on unseen tables that is over 15 percentage points (pp) higher than random guessing, a feat that is not possible with existing state-of-the-art tabular prediction models (e.g. XGBoost, TabPFN). In the few-shot setting (1-32 shots), without any fine-tuning on the target datasets, TabuLa-8B is 5-15 pp more accurate than XGBoost and TabPFN models that are explicitly trained on equal, or even up to 16x more data. We release our model, code, and data along with the publication of this paper.

Large Scale Transfer Learning for Tabular Data via Language Modeling

TL;DR

TabuLa-8B introduces a transformer-based approach to tabular data prediction by fine-tuning Llama 3-8B on a vast, filtered tabular corpus (T4) using a novel row-causal masking and serialization scheme. The Tremendous TabLib Trawl provides about 4 million tables with over 2.1 billion rows and 100 billion tokens to enable large-scale transfer learning for tabular data. Across 329 datasets, TabuLa-8B demonstrates strong zero-shot and few-shot transfer, outperforming state-of-the-art baselines and showing notable sample efficiency, all while enabling open-source reproducibility. The work highlights both the promise and practical considerations of tabular foundation models and lays out a concrete path for future research and safer, transparent deployment.

Abstract

Tabular data -- structured, heterogeneous, spreadsheet-style data with rows and columns -- is widely used in practice across many domains. However, while recent foundation models have reduced the need for developing task-specific datasets and predictors in domains such as language modeling and computer vision, this transfer learning paradigm has not had similar impact in the tabular domain. In this work, we seek to narrow this gap and present TabuLa-8B, a language model for tabular prediction. We define a process for extracting a large, high-quality training dataset from the TabLib corpus, proposing methods for tabular data filtering and quality control. Using the resulting dataset, which comprises over 2.1B rows from over 4M unique tables, we fine-tune a Llama 3-8B large language model (LLM) for tabular data prediction (classification and binned regression) using a novel packing and attention scheme for tabular prediction. Through evaluation across a test suite of 329 datasets, we find that TabuLa-8B has zero-shot accuracy on unseen tables that is over 15 percentage points (pp) higher than random guessing, a feat that is not possible with existing state-of-the-art tabular prediction models (e.g. XGBoost, TabPFN). In the few-shot setting (1-32 shots), without any fine-tuning on the target datasets, TabuLa-8B is 5-15 pp more accurate than XGBoost and TabPFN models that are explicitly trained on equal, or even up to 16x more data. We release our model, code, and data along with the publication of this paper.
Paper Structure (68 sections, 1 equation, 18 figures, 1 table)

This paper contains 68 sections, 1 equation, 18 figures, 1 table.

Figures (18)

  • Figure 1: TabuLa-8B outperforms SOTA tabular baselines across $0-32$-shot tasks from five tabular benchmarks.
  • Figure 2: \ref{['fig:attn-mask']}: Illustration of the row-causal tabular mask (RCTM) representing a batch during training. Each triangular block represents potentially many rows from a single table (detail shown at left). Shaded groups within this block represent tokens from one row in the table. This structure implicitly trains the model for few-shot learning by permitting it to attend to previous rows from the table, but not to rows in other tables. \ref{['fig:serialization']}: Serialization of tabular data into text. The model is trained to produce the tokens following the <|endinput|> token.
  • Figure 3: Sketch of dataset generation pipeline. 627M tables from TabLib eggert2023tablib are filtered by applying rules at the table, row, and column level. Then, for each table, we identify valid and high-quality prediction targets in an unsupervised manner and use the results for training TabuLa-8B.
  • Figure 4: Zero- and few-shot accuracy across five tabular benchmarks. For each benchmark, we evaluate on all tasks, but in the figures above we only display the subset of tasks where $k$ shots fit into the 8192-token context window of TabuLa-8B. Complete results are in Supplementary Section \ref{['sec:per-task-results-table']}. The final plot (lower right) shows curves separately over decontaminated vs. potentially-contaminated evaluation tasks (see Section \ref{['sec:results-contamination']}); we find no impact on our overall findings due to contamination (and performance on tasks which may be in our training set is lower on average, across all models).
  • Figure 5: Summary metrics for the T4 dataset.
  • ...and 13 more figures