Real-TabPFN: Improving Tabular Foundation Models via Continued Pre-training With Real-World Data
Anurag Garg, Muhammad Ali, Noah Hollmann, Lennart Purucker, Samuel Müller, Frank Hutter
TL;DR
The paper addresses improving tabular foundation learning by bridging synthetic and real-world data through a two-stage continued pre-training of TabPFNv2 on curated real-world OpenML and Kaggle tables, aided by an L2-SP regularizer and a small learning rate. Real-TabPFN demonstrates substantial gains, increasing mean normalized ROC-AUC from $0.954$ to $0.976$ across 29 AMLB datasets and outperforming all baselines without hyperparameter tuning. Key insights include the positive impact of larger continued-pre-training context and the complementary benefits of OpenML and Kaggle data compared to purely synthetic or single-source corpora. The findings highlight the value of real-world data in enhancing in-context learning for tabular tasks and provide open-source weights for broader adoption and further research.
Abstract
Foundation models for tabular data, like TabPFN, achieve strong performance on small datasets when pre-trained solely on synthetic data. We show that this performance can be significantly boosted by a targeted continued pre-training phase. Specifically, we demonstrate that leveraging a small, curated collection of large, real-world datasets for continued pre-training yields superior downstream predictive accuracy compared to using broader, potentially noisier corpora like CommonCrawl or GitTables. Our resulting model, Real-TabPFN, achieves substantial performance gains on 29 datasets from the OpenML AutoML Benchmark.
