Table of Contents
Fetching ...

TabArena: A Living Benchmark for Machine Learning on Tabular Data

Nick Erickson, Lennart Purucker, Andrej Tschalzev, David Holzmüller, Prateek Mutalik Desai, David Salinas, Frank Hutter

TL;DR

TabArena introduces a living benchmark for tabular IID data, addressing the stagnation of static benchmarks by implementing continuous curation protocols, a public leaderboard, and maintainers. It initializes the platform with 51 carefully selected datasets and 16 models (including tabular foundation models), using robust evaluation designs that emphasize cross-validation ensembles and hyperparameter exploration. Key findings show that post-hoc ensembling drives peak performance, deep learning methods catch up with adequate time budgets, and tabular foundation models shine on small datasets; ensembles across models push the state-of-the-art further. The work also discusses data curation challenges, reproducibility, and the societal/environmental trade-offs of large-scale benchmarking, laying groundwork for a scalable, open, and responsible living benchmark ecosystem.

Abstract

With the growing popularity of deep learning and foundation models for tabular data, the need for standardized and reliable benchmarks is higher than ever. However, current benchmarks are static. Their design is not updated even if flaws are discovered, model versions are updated, or new models are released. To address this, we introduce TabArena, the first continuously maintained living tabular benchmarking system. To launch TabArena, we manually curate a representative collection of datasets and well-implemented models, conduct a large-scale benchmarking study to initialize a public leaderboard, and assemble a team of experienced maintainers. Our results highlight the influence of validation method and ensembling of hyperparameter configurations to benchmark models at their full potential. While gradient-boosted trees are still strong contenders on practical tabular datasets, we observe that deep learning methods have caught up under larger time budgets with ensembling. At the same time, foundation models excel on smaller datasets. Finally, we show that ensembles across models advance the state-of-the-art in tabular machine learning. We observe that some deep learning models are overrepresented in cross-model ensembles due to validation set overfitting, and we encourage model developers to address this issue. We launch TabArena with a public leaderboard, reproducible code, and maintenance protocols to create a living benchmark available at https://tabarena.ai.

TabArena: A Living Benchmark for Machine Learning on Tabular Data

TL;DR

TabArena introduces a living benchmark for tabular IID data, addressing the stagnation of static benchmarks by implementing continuous curation protocols, a public leaderboard, and maintainers. It initializes the platform with 51 carefully selected datasets and 16 models (including tabular foundation models), using robust evaluation designs that emphasize cross-validation ensembles and hyperparameter exploration. Key findings show that post-hoc ensembling drives peak performance, deep learning methods catch up with adequate time budgets, and tabular foundation models shine on small datasets; ensembles across models push the state-of-the-art further. The work also discusses data curation challenges, reproducibility, and the societal/environmental trade-offs of large-scale benchmarking, laying groundwork for a scalable, open, and responsible living benchmark ecosystem.

Abstract

With the growing popularity of deep learning and foundation models for tabular data, the need for standardized and reliable benchmarks is higher than ever. However, current benchmarks are static. Their design is not updated even if flaws are discovered, model versions are updated, or new models are released. To address this, we introduce TabArena, the first continuously maintained living tabular benchmarking system. To launch TabArena, we manually curate a representative collection of datasets and well-implemented models, conduct a large-scale benchmarking study to initialize a public leaderboard, and assemble a team of experienced maintainers. Our results highlight the influence of validation method and ensembling of hyperparameter configurations to benchmark models at their full potential. While gradient-boosted trees are still strong contenders on practical tabular datasets, we observe that deep learning methods have caught up under larger time budgets with ensembling. At the same time, foundation models excel on smaller datasets. Finally, we show that ensembles across models advance the state-of-the-art in tabular machine learning. We observe that some deep learning models are overrepresented in cross-model ensembles due to validation set overfitting, and we encourage model developers to address this issue. We launch TabArena with a public leaderboard, reproducible code, and maintenance protocols to create a living benchmark available at https://tabarena.ai.

Paper Structure

This paper contains 45 sections, 2 equations, 23 figures, 19 tables.

Figures (23)

  • Figure 1: TabArena-v0.1 Leaderboard. We evaluate models under default parameters, tuning, and weighted ensembling caruana-icml04a of hyperparameters. Since TabICL and TabPFNv2 are not applicable to all datasets, we evaluate them on subsets of the benchmark in \ref{['fig:sub_benchmarks']} .
  • Figure 2: Data Curation Results. The figure shows why and how many datasets we filter based on our criteria. We filter datasets that are duplicates, not from a tabular domain, not a real predictive task, tiny, have quality or license issues, and are not IID.
  • Figure 3: Characteristics of Datasets in TabArena. On the left, we show the number of datasets per task type, license, source of the dataset, and age group. On the right, we show the number of features (columns) and samples (rows), as well as the percentage of categorical features per dataset.
  • Figure 4: Leaderboard for TabPFNv2-compatible (left) and TabICL-compatible (right) datasets. For TabPFNv2, we obtain $33$ datasets ($\leq$ 10K training samples, $\leq 500$ features). For TabICL, we obtain $36$classification datasets ($\leq$ 100K, $\leq 500$). Everything but the datasets is identical to \ref{['fig:main']}.
  • Figure 5: (Left) Pareto front of improvability and inference time. We report the median inference time per 1000 samples across all datasets. (Right) Improvability tuning trajectories. Time is shown as the tuning time with points from left to right marking ensembles of increasing numbers of random configurations (1, 2, 5, 10, 25, 50, 100, 150, 201). The trajectories are sampled 20 times from all trials and averaged. The right-most highlighted points use all configurations.
  • ...and 18 more figures