Table of Contents
Fetching ...

TabDPT: Scaling Tabular Foundation Models on Real Data

Junwei Ma, Valentin Thomas, Rasa Hosseinzadeh, Hamidreza Kamkari, Alex Labach, Jesse C. Cresswell, Keyvan Golestan, Guangwei Yu, Anthony L. Caterini, Maksims Volkovs

TL;DR

TabDPT tackles the challenge of generalizing tabular foundation models across heterogeneous real-world datasets by combining ICL-based retrieval with self-supervised learning on real data. The method uses a row-based transformer encoder with a shared backbone for classification and regression, trained via SSL targets and retrieval-based context aligned with inference-time retrieval. Empirical results on CC18 and CTR23 show TabDPT achieving state-of-the-art or competitive performance, with clear evidence that real data and retrieval-based pre-training yield faster convergence and better unseen-task generalization, and that model/data scaling follows power-law trends similar to LLMs. The work provides open-source implementations and datasets, highlighting practical impact for scalable, cross-task tabular modeling and suggesting directions for metadata integration and timeseries extension.

Abstract

Tabular data is one of the most ubiquitous sources of information worldwide, spanning a wide variety of domains. This inherent heterogeneity has slowed the development of Tabular Foundation Models (TFMs) capable of fast generalization to unseen datasets. In-Context Learning (ICL) has recently emerged as a promising solution for TFMs, enabling dynamic adaptation to new tasks without additional tuning. While many studies have attempted to re-purpose large language models for tabular ICL, they have had limited success, so recent works have focused on developing tabular-specific foundation models. In this work, we propose an approach to combine ICL-based retrieval with self supervised learning to train tabular foundation models. We also investigate the utility of real vs. synthetic data for model pre-training, and show that real data can contain useful signal not easily captured in synthetic training. Specifically, we show that incorporating real data during the pre-training phase can lead to significantly faster training and better downstream generalization to unseen data. Our resulting model, TabDPT, achieves top performance on both regression (CTR23) and classification (CC18) benchmarks. Importantly, we also demonstrate that with our pre-training procedure, scaling both model and data size leads to consistent performance improvements that follow power laws. This echoes scaling laws in LLMs and other foundation models, and suggests that Internet-scale TFMs can be achievable. We open-source our full pipeline: inference code including trained model weights can be found at github.com/layer6ai-labs/TabDPT-inference, and the training code to reproduce experiments can be found at github.com/layer6ai-labs/TabDPT-training.

TabDPT: Scaling Tabular Foundation Models on Real Data

TL;DR

TabDPT tackles the challenge of generalizing tabular foundation models across heterogeneous real-world datasets by combining ICL-based retrieval with self-supervised learning on real data. The method uses a row-based transformer encoder with a shared backbone for classification and regression, trained via SSL targets and retrieval-based context aligned with inference-time retrieval. Empirical results on CC18 and CTR23 show TabDPT achieving state-of-the-art or competitive performance, with clear evidence that real data and retrieval-based pre-training yield faster convergence and better unseen-task generalization, and that model/data scaling follows power-law trends similar to LLMs. The work provides open-source implementations and datasets, highlighting practical impact for scalable, cross-task tabular modeling and suggesting directions for metadata integration and timeseries extension.

Abstract

Tabular data is one of the most ubiquitous sources of information worldwide, spanning a wide variety of domains. This inherent heterogeneity has slowed the development of Tabular Foundation Models (TFMs) capable of fast generalization to unseen datasets. In-Context Learning (ICL) has recently emerged as a promising solution for TFMs, enabling dynamic adaptation to new tasks without additional tuning. While many studies have attempted to re-purpose large language models for tabular ICL, they have had limited success, so recent works have focused on developing tabular-specific foundation models. In this work, we propose an approach to combine ICL-based retrieval with self supervised learning to train tabular foundation models. We also investigate the utility of real vs. synthetic data for model pre-training, and show that real data can contain useful signal not easily captured in synthetic training. Specifically, we show that incorporating real data during the pre-training phase can lead to significantly faster training and better downstream generalization to unseen data. Our resulting model, TabDPT, achieves top performance on both regression (CTR23) and classification (CC18) benchmarks. Importantly, we also demonstrate that with our pre-training procedure, scaling both model and data size leads to consistent performance improvements that follow power laws. This echoes scaling laws in LLMs and other foundation models, and suggests that Internet-scale TFMs can be achievable. We open-source our full pipeline: inference code including trained model weights can be found at github.com/layer6ai-labs/TabDPT-inference, and the training code to reproduce experiments can be found at github.com/layer6ai-labs/TabDPT-training.

Paper Structure

This paper contains 26 sections, 2 equations, 10 figures, 6 tables, 1 algorithm.

Figures (10)

  • Figure 1: Scaling behavior for our foundation tabular models. Increasing model or pre-training data size (number of cells) leads to consistent improvements predictable by power laws (fitted solid lines).
  • Figure 2: (a) We sample $B$ tables from different datasets to construct $X \in \mathbb{R}^{B \times N \times F_{\max}}$ and $y \in \mathbb{R}^{B \times N}$. (b) $X$ and $y$ are partitioned into context $\{X_\text{ctx}, y_\text{ctx}\}$ and query $X_\text{qy}$ inputs and passed through embedding functions (indicated by rectangle/triangle). Embeddings of $X_\text{ctx}$ and $y_\text{ctx}$ are summed together, concatenated with context embedding of $X_\text{qy}$, and passed through a transformer encoder to get classification $\hat{y}_\text{cls}$ or regression $\hat{y}_\text{reg}$ prediction for the query. Loss between this prediction and query targets $y_\text{qy}$ is used to update the model.
  • Figure 3: (a) Pairwise win-rate comparison. A win is counted for the method that achieves the higher classification/regression accuracy/$R^2$ on a given dataset. (b) Inference runtime vs performance. TabDPT models are ordered by context size. Non-TFM baseline runtimes are the total of hyperparameter optimization and inference.
  • Figure 4: (a) Ablation of key components in training (Tr) and inference (Inf). A higher blue bar and a higher green bar indicate greater reduction in AUC and $R^2$ respectively. (b) Test loss curves on CC18 when training with and without SSL on real data as well as synthetic data only.
  • Figure : One Training Step of TabDPT
  • ...and 5 more figures