Table of Contents
Fetching ...

TabSTAR: A Tabular Foundation Model for Tabular Data with Text Fields

Alan Arazi, Eilam Shapira, Roi Reichart

TL;DR

TabSTAR addresses the challenge of learning from tabular data with free-text features by introducing semantically target-aware representations and an unfrozen text encoder to enable end-to-end cross-dataset transfer. The model verbalizes features and targets, fuses semantic and numerical information, and uses a Transformer-based interaction module with shared prediction heads to support any number of classes without dataset-specific parameters. Empirically, TabSTAR achieves state-of-the-art performance on classification tasks with text features and exhibits scaling laws with the size of pretraining data, suggesting strong potential for further gains with larger corpora and model families. The work also analyzes design choices (encoder unfreezing, pretraining scale, and numerical verbalization) to guide future improvements and discusses practical considerations like inference cost and memory for real-world deployment.

Abstract

While deep learning has achieved remarkable success across many domains, it has historically underperformed on tabular learning tasks, which remain dominated by gradient boosting decision trees. However, recent advancements are paving the way for Tabular Foundation Models, which can leverage real-world knowledge and generalize across diverse datasets, particularly when the data contains free-text. Although incorporating language model capabilities into tabular tasks has been explored, most existing methods utilize static, target-agnostic textual representations, limiting their effectiveness. We introduce TabSTAR: a Tabular Foundation Model with Semantically Target-Aware Representations. TabSTAR is designed to enable transfer learning on tabular data with textual features, with an architecture free of dataset-specific parameters. It unfreezes a pretrained text encoder and takes as input target tokens, which provide the model with the context needed to learn task-specific embeddings. TabSTAR achieves state-of-the-art performance for both medium- and large-sized datasets across known benchmarks of classification tasks with text features, and its pretraining phase exhibits scaling laws in the number of datasets, offering a pathway for further performance improvements.

TabSTAR: A Tabular Foundation Model for Tabular Data with Text Fields

TL;DR

TabSTAR addresses the challenge of learning from tabular data with free-text features by introducing semantically target-aware representations and an unfrozen text encoder to enable end-to-end cross-dataset transfer. The model verbalizes features and targets, fuses semantic and numerical information, and uses a Transformer-based interaction module with shared prediction heads to support any number of classes without dataset-specific parameters. Empirically, TabSTAR achieves state-of-the-art performance on classification tasks with text features and exhibits scaling laws with the size of pretraining data, suggesting strong potential for further gains with larger corpora and model families. The work also analyzes design choices (encoder unfreezing, pretraining scale, and numerical verbalization) to guide future improvements and discusses practical considerations like inference cost and memory for real-world deployment.

Abstract

While deep learning has achieved remarkable success across many domains, it has historically underperformed on tabular learning tasks, which remain dominated by gradient boosting decision trees. However, recent advancements are paving the way for Tabular Foundation Models, which can leverage real-world knowledge and generalize across diverse datasets, particularly when the data contains free-text. Although incorporating language model capabilities into tabular tasks has been explored, most existing methods utilize static, target-agnostic textual representations, limiting their effectiveness. We introduce TabSTAR: a Tabular Foundation Model with Semantically Target-Aware Representations. TabSTAR is designed to enable transfer learning on tabular data with textual features, with an architecture free of dataset-specific parameters. It unfreezes a pretrained text encoder and takes as input target tokens, which provide the model with the context needed to learn task-specific embeddings. TabSTAR achieves state-of-the-art performance for both medium- and large-sized datasets across known benchmarks of classification tasks with text features, and its pretraining phase exhibits scaling laws in the number of datasets, offering a pathway for further performance improvements.

Paper Structure

This paper contains 86 sections, 8 figures, 29 tables.

Figures (8)

  • Figure 1: The TabSTAR architecture illustrated with our toy dataset. The model processes numerical features, textual features, and all possible target values for classification.
  • Figure 2: Comparison of normalized scores with 95% CIs between TabSTAR and baseline models in classification tasks, evaluated on up to 10,000 examples (left) and above 10,000 (right).
  • Figure 3: Comparison of normalized scores with 95% CIs between TabSTAR and baseline models in regression tasks, evaluated on up to 10,000 examples (left) and above 10,000 (right).
  • Figure 4: Performance as a function of the number of encoder layers unfrozen: Validation loss during TabSTAR's pretraining (left) and normalized scores with 95% CIs on the downstream tasks (right). Unfreezing even a single encoder layer significantly improves the performance of TabSTAR.
  • Figure 5: Average performance on downstream tasks as a function of the number of pretraining datasets (in log scale). We use AUROC for classification (left), and $R^2$ for regression (right).
  • ...and 3 more figures