Table of Contents
Fetching ...

TabPFN-2.5: Advancing the State of the Art in Tabular Foundation Models

Léo Grinsztajn, Klemens Flöge, Oscar Key, Felix Birkel, Philipp Jund, Brendan Roof, Benjamin Jäger, Dominik Safaric, Simone Alessi, Adrian Hayler, Mihir Manium, Rosen Yu, Felix Jablonski, Shi Bin Hoo, Anurag Garg, Jake Robertson, Magnus Bühler, Vladyslav Moroshan, Lennart Purucker, Clara Cornu, Lilly Charlotte Wehrhahn, Alessandro Bonetto, Bernhard Schölkopf, Sauraj Gambhir, Noah Hollmann, Frank Hutter

TL;DR

TabPFN-2.5 advances tabular foundation modeling by scaling to up to 50{,}000 samples and 2{,}000 features while achieving state-of-the-art forward-pass performance on TabArena and matching AutoGluon 1.4 when tuned. It introduces deeper architectures, richer priors, and new calibration/inference modules, plus a distillation engine that converts TabPFN-2.5 into fast, deployable MLPs or tree ensembles without sacrificing much accuracy. The paper shows strong results on both public benchmarks and internal datasets, demonstrates faster inference with multi-GPU setups and FlashAttention, and highlights strong causal-inference performance via RealCause and meta-learners. Collectively, TabPFN-2.5 solidifies tabular foundation models as practical, scalable building blocks for production systems and points toward millions-of-rows capabilities through retrieval, fine-tuning, and architectural innovations.

Abstract

The first tabular foundation model, TabPFN, and its successor TabPFNv2 have impacted tabular AI substantially, with dozens of methods building on it and hundreds of applications across different use cases. This report introduces TabPFN-2.5, the next generation of our tabular foundation model, built for datasets with up to 50,000 data points and 2,000 features, a 20x increase in data cells compared to TabPFNv2. TabPFN-2.5 is now the leading method for the industry standard benchmark TabArena (which contains datasets with up to 100,000 training data points), substantially outperforming tuned tree-based models and matching the accuracy of AutoGluon 1.4, a complex four-hour tuned ensemble that even includes the previous TabPFNv2. Remarkably, default TabPFN-2.5 has a 100% win rate against default XGBoost on small to medium-sized classification datasets (<=10,000 data points, 500 features) and a 87% win rate on larger datasets up to 100K samples and 2K features (85% for regression). For production use cases, we introduce a new distillation engine that converts TabPFN-2.5 into a compact MLP or tree ensemble, preserving most of its accuracy while delivering orders-of-magnitude lower latency and plug-and-play deployment. This new release will immediately strengthen the performance of the many applications and methods already built on the TabPFN ecosystem.

TabPFN-2.5: Advancing the State of the Art in Tabular Foundation Models

TL;DR

TabPFN-2.5 advances tabular foundation modeling by scaling to up to 50{,}000 samples and 2{,}000 features while achieving state-of-the-art forward-pass performance on TabArena and matching AutoGluon 1.4 when tuned. It introduces deeper architectures, richer priors, and new calibration/inference modules, plus a distillation engine that converts TabPFN-2.5 into fast, deployable MLPs or tree ensembles without sacrificing much accuracy. The paper shows strong results on both public benchmarks and internal datasets, demonstrates faster inference with multi-GPU setups and FlashAttention, and highlights strong causal-inference performance via RealCause and meta-learners. Collectively, TabPFN-2.5 solidifies tabular foundation models as practical, scalable building blocks for production systems and points toward millions-of-rows capabilities through retrieval, fine-tuning, and architectural innovations.

Abstract

The first tabular foundation model, TabPFN, and its successor TabPFNv2 have impacted tabular AI substantially, with dozens of methods building on it and hundreds of applications across different use cases. This report introduces TabPFN-2.5, the next generation of our tabular foundation model, built for datasets with up to 50,000 data points and 2,000 features, a 20x increase in data cells compared to TabPFNv2. TabPFN-2.5 is now the leading method for the industry standard benchmark TabArena (which contains datasets with up to 100,000 training data points), substantially outperforming tuned tree-based models and matching the accuracy of AutoGluon 1.4, a complex four-hour tuned ensemble that even includes the previous TabPFNv2. Remarkably, default TabPFN-2.5 has a 100% win rate against default XGBoost on small to medium-sized classification datasets (<=10,000 data points, 500 features) and a 87% win rate on larger datasets up to 100K samples and 2K features (85% for regression). For production use cases, we introduce a new distillation engine that converts TabPFN-2.5 into a compact MLP or tree ensemble, preserving most of its accuracy while delivering orders-of-magnitude lower latency and plug-and-play deployment. This new release will immediately strengthen the performance of the many applications and methods already built on the TabPFN ecosystem.

Paper Structure

This paper contains 45 sections, 20 figures, 2 tables.

Figures (20)

  • Figure 1: TabPFN-2.5 performance on the standard TabArena-lite benchmark erickson2025tabarena, TabPFNv2 classification subset. TabPFN-2.5 outperforms any other model in a forward pass, and marks a strong leap from TabPFNv2. When fine-tuned on real data, Real-TabPFN-2.5 shows even stronger performance. The horizontal dotted line stands for AutoGluon 1.4 extreme mode tuned for 4 hours, an ensemble of models including TabPFNv2.
  • Figure 1: Summary of TabPFN model variants. Max Rows and Features are the recommended maximum sizes. The models also fit larger datasets but are not built and evaluated for these settings.
  • Figure 2: TabPFN-2.5 clearly outperforms TabPFNv2. We show normalized performance for each dataset of the TabPFNv2 subset of TabArena. TabPFN-2.5 often performs much better and is never much worse.
  • Figure 3: TabArena-Lite results on classification (left) and regression (right), restricted to datasets with less than 10K training samples and 500 features. Note that tuning for TabPFN-2.5 is only based on 60 random configs compared to 200 for the baselines. The vertical dotted line stands for AutoGluon 1.4 extreme mode tuned for 4 hours, an ensemble of models including TabPFNv2 autogluon_tabular.
  • Figure 4: TabArena-Lite results on classification (left) and regression (right), evaluated on all datasets, going up to 100K training rows and 2K features. Note that tuning for TabPFN-2.5 is only based on 60 random configs compared to 200 for the baselines, and that we removed the "dt-pfn" option from our tuning search space for the 4 largest datasets in the benchmark to reduce the tuning time. The vertical dotted line stands for AutoGluon 1.4 extreme mode tuned for 4 hours, an ensemble of models including TabPFNv2 autogluon_tabular.
  • ...and 15 more figures