Table of Contents
Fetching ...

Real-TabPFN: Improving Tabular Foundation Models via Continued Pre-training With Real-World Data

Anurag Garg, Muhammad Ali, Noah Hollmann, Lennart Purucker, Samuel Müller, Frank Hutter

TL;DR

The paper addresses improving tabular foundation learning by bridging synthetic and real-world data through a two-stage continued pre-training of TabPFNv2 on curated real-world OpenML and Kaggle tables, aided by an L2-SP regularizer and a small learning rate. Real-TabPFN demonstrates substantial gains, increasing mean normalized ROC-AUC from $0.954$ to $0.976$ across 29 AMLB datasets and outperforming all baselines without hyperparameter tuning. Key insights include the positive impact of larger continued-pre-training context and the complementary benefits of OpenML and Kaggle data compared to purely synthetic or single-source corpora. The findings highlight the value of real-world data in enhancing in-context learning for tabular tasks and provide open-source weights for broader adoption and further research.

Abstract

Foundation models for tabular data, like TabPFN, achieve strong performance on small datasets when pre-trained solely on synthetic data. We show that this performance can be significantly boosted by a targeted continued pre-training phase. Specifically, we demonstrate that leveraging a small, curated collection of large, real-world datasets for continued pre-training yields superior downstream predictive accuracy compared to using broader, potentially noisier corpora like CommonCrawl or GitTables. Our resulting model, Real-TabPFN, achieves substantial performance gains on 29 datasets from the OpenML AutoML Benchmark.

Real-TabPFN: Improving Tabular Foundation Models via Continued Pre-training With Real-World Data

TL;DR

The paper addresses improving tabular foundation learning by bridging synthetic and real-world data through a two-stage continued pre-training of TabPFNv2 on curated real-world OpenML and Kaggle tables, aided by an L2-SP regularizer and a small learning rate. Real-TabPFN demonstrates substantial gains, increasing mean normalized ROC-AUC from to across 29 AMLB datasets and outperforming all baselines without hyperparameter tuning. Key insights include the positive impact of larger continued-pre-training context and the complementary benefits of OpenML and Kaggle data compared to purely synthetic or single-source corpora. The findings highlight the value of real-world data in enhancing in-context learning for tabular tasks and provide open-source weights for broader adoption and further research.

Abstract

Foundation models for tabular data, like TabPFN, achieve strong performance on small datasets when pre-trained solely on synthetic data. We show that this performance can be significantly boosted by a targeted continued pre-training phase. Specifically, we demonstrate that leveraging a small, curated collection of large, real-world datasets for continued pre-training yields superior downstream predictive accuracy compared to using broader, potentially noisier corpora like CommonCrawl or GitTables. Our resulting model, Real-TabPFN, achieves substantial performance gains on 29 datasets from the OpenML AutoML Benchmark.

Paper Structure

This paper contains 13 sections, 1 equation, 6 figures.

Figures (6)

  • Figure 1: Per Dataset Normalized ROC Comparison of TabPFN (default) and Real-TabPFN (ours) on the 29 datasets from the OpenML AutoML Benchmark Datasets. Wilcoxon p refers to the two-sided Wilcoxon signed-rank test p value.
  • Figure 2: Distribution of dataset sizes (number of rows and features) from various sources. The prevalence of smaller datasets in broad corpora like CommonCrawl and GitTable contrasts with the larger datasets from OpenML and Kaggle.
  • Figure 3: Mean Normalized ROC AUC Comparsion of Real-TabPFN with all the default and the tuned versions of the baselines on the AutoMLBenchmark. Scores were normalized per dataset, with 1.0 representing the best and 0.0 the worst performance with respect to all baselines.
  • Figure 4: Increase in normalized ROC AUC as the continued-pre-training context grows. The gains are shown relative to the base TabPFNv2 model performance which was synthetically pre-trained with 2,048 context size.
  • Figure 5: Increase in normalized ROC AUC as the training data source is varied. The gains are shown relative to the base TabPFNv2 model performance which was synthetically pre-trained.
  • ...and 1 more figures