Generalization Can Emerge in Tabular Foundation Models From a Single Table

Junwei Ma; Nour Shaheen; Alex Labach; Amine Mhedhbi; Frank Hutter; Anthony L. Caterini; Valentin Thomas

Generalization Can Emerge in Tabular Foundation Models From a Single Table

Junwei Ma, Nour Shaheen, Alex Labach, Amine Mhedhbi, Frank Hutter, Anthony L. Caterini, Valentin Thomas

TL;DR

The paper addresses cross-domain generalization in tabular in-context learning and questions the necessity of large pre-training corpora. It shows that a transformer trained from scratch with a simple self-supervised objective on a single real table can generalize to diverse benchmarks, and it analyzes which data properties and pre-training task structures most drive this transfer, using $N_{train}=88$ datasets and $N_{eval}=107$ tasks across benchmarks with a reported $R^2=0.67$ for meta-predicted generalization. Key findings indicate that the number of features and, crucially, the number of pre-training tasks (task diversity) are primary drivers of generalization, more so than dataset size. This work suggests a data-efficient path to Tabular Foundation Models and provides guidance on how to design pre-training regimes that maximize task variety and feature coverage.

Abstract

Deep tabular modelling increasingly relies on in-context learning where, during inference, a model receives a set of $(x,y)$ pairs as context and predicts labels for new inputs without weight updates. We challenge the prevailing view that broad generalization here requires pre-training on large synthetic corpora (e.g., TabPFN priors) or a large collection of real data (e.g., TabDPT training datasets), discovering that a relatively small amount of data suffices for generalization. We find that simple self-supervised pre-training on just a \emph{single} real table can produce surprisingly strong transfer across heterogeneous benchmarks. By systematically pre-training and evaluating on many diverse datasets, we analyze what aspects of the data are most important for building a Tabular Foundation Model (TFM) generalizing across domains. We then connect this to the pre-training procedure shared by most TFMs and show that the number and quality of \emph{tasks} one can construct from a dataset is key to downstream performance.

Generalization Can Emerge in Tabular Foundation Models From a Single Table

TL;DR

datasets and

tasks across benchmarks with a reported

for meta-predicted generalization. Key findings indicate that the number of features and, crucially, the number of pre-training tasks (task diversity) are primary drivers of generalization, more so than dataset size. This work suggests a data-efficient path to Tabular Foundation Models and provides guidance on how to design pre-training regimes that maximize task variety and feature coverage.

Abstract

Deep tabular modelling increasingly relies on in-context learning where, during inference, a model receives a set of

pairs as context and predicts labels for new inputs without weight updates. We challenge the prevailing view that broad generalization here requires pre-training on large synthetic corpora (e.g., TabPFN priors) or a large collection of real data (e.g., TabDPT training datasets), discovering that a relatively small amount of data suffices for generalization. We find that simple self-supervised pre-training on just a \emph{single} real table can produce surprisingly strong transfer across heterogeneous benchmarks. By systematically pre-training and evaluating on many diverse datasets, we analyze what aspects of the data are most important for building a Tabular Foundation Model (TFM) generalizing across domains. We then connect this to the pre-training procedure shared by most TFMs and show that the number and quality of \emph{tasks} one can construct from a dataset is key to downstream performance.

Generalization Can Emerge in Tabular Foundation Models From a Single Table

TL;DR

Abstract

Generalization Can Emerge in Tabular Foundation Models From a Single Table

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)