Table of Contents
Fetching ...

Generalization Can Emerge in Tabular Foundation Models From a Single Table

Junwei Ma, Nour Shaheen, Alex Labach, Amine Mhedhbi, Frank Hutter, Anthony L. Caterini, Valentin Thomas

TL;DR

The paper addresses cross-domain generalization in tabular in-context learning and questions the necessity of large pre-training corpora. It shows that a transformer trained from scratch with a simple self-supervised objective on a single real table can generalize to diverse benchmarks, and it analyzes which data properties and pre-training task structures most drive this transfer, using $N_{train}=88$ datasets and $N_{eval}=107$ tasks across benchmarks with a reported $R^2=0.67$ for meta-predicted generalization. Key findings indicate that the number of features and, crucially, the number of pre-training tasks (task diversity) are primary drivers of generalization, more so than dataset size. This work suggests a data-efficient path to Tabular Foundation Models and provides guidance on how to design pre-training regimes that maximize task variety and feature coverage.

Abstract

Deep tabular modelling increasingly relies on in-context learning where, during inference, a model receives a set of $(x,y)$ pairs as context and predicts labels for new inputs without weight updates. We challenge the prevailing view that broad generalization here requires pre-training on large synthetic corpora (e.g., TabPFN priors) or a large collection of real data (e.g., TabDPT training datasets), discovering that a relatively small amount of data suffices for generalization. We find that simple self-supervised pre-training on just a \emph{single} real table can produce surprisingly strong transfer across heterogeneous benchmarks. By systematically pre-training and evaluating on many diverse datasets, we analyze what aspects of the data are most important for building a Tabular Foundation Model (TFM) generalizing across domains. We then connect this to the pre-training procedure shared by most TFMs and show that the number and quality of \emph{tasks} one can construct from a dataset is key to downstream performance.

Generalization Can Emerge in Tabular Foundation Models From a Single Table

TL;DR

The paper addresses cross-domain generalization in tabular in-context learning and questions the necessity of large pre-training corpora. It shows that a transformer trained from scratch with a simple self-supervised objective on a single real table can generalize to diverse benchmarks, and it analyzes which data properties and pre-training task structures most drive this transfer, using datasets and tasks across benchmarks with a reported for meta-predicted generalization. Key findings indicate that the number of features and, crucially, the number of pre-training tasks (task diversity) are primary drivers of generalization, more so than dataset size. This work suggests a data-efficient path to Tabular Foundation Models and provides guidance on how to design pre-training regimes that maximize task variety and feature coverage.

Abstract

Deep tabular modelling increasingly relies on in-context learning where, during inference, a model receives a set of pairs as context and predicts labels for new inputs without weight updates. We challenge the prevailing view that broad generalization here requires pre-training on large synthetic corpora (e.g., TabPFN priors) or a large collection of real data (e.g., TabDPT training datasets), discovering that a relatively small amount of data suffices for generalization. We find that simple self-supervised pre-training on just a \emph{single} real table can produce surprisingly strong transfer across heterogeneous benchmarks. By systematically pre-training and evaluating on many diverse datasets, we analyze what aspects of the data are most important for building a Tabular Foundation Model (TFM) generalizing across domains. We then connect this to the pre-training procedure shared by most TFMs and show that the number and quality of \emph{tasks} one can construct from a dataset is key to downstream performance.

Paper Structure

This paper contains 8 sections, 4 figures.

Figures (4)

  • Figure 1: Transfer from a single pre‑training dataset. Left: Training only on vectorized MNIST (treated as a table) and evaluating on California Housing. Middle and Right: Training on the Colleges dataset and evaluating on the full CC-18 and CTR-23 evaluation suites, respectively.
  • Figure 2: Universality of the dataset quality. Figure 2a: We compute the rank of training sets for each evaluation dataset and then plot a histogram of the Spearman correlation between the ranks on all pairs of evaluation datasets. Most evaluation datasets have very correlated ranks. Figure 2b: We group the training and evaluation datasets into distinct domains and plot the density map for average ranks (lower is better). Training and evaluation pairs from the same domain do not appear to transfer better than ones from different domains.
  • Figure 3: Not all cells in a table are made equal: the number of features matters much more than the number of instances as pre-training dataset for tabular ICL models.
  • Figure 4: Downstream AUC as a function of the number of unique tasks used during training. More tasks during pre-training consistently leads to better transfer. This demonstrates that both the amount of tasks and the quality of tasks drive the generalization performance.