TabArena: A Living Benchmark for Machine Learning on Tabular Data
Nick Erickson, Lennart Purucker, Andrej Tschalzev, David Holzmüller, Prateek Mutalik Desai, David Salinas, Frank Hutter
TL;DR
TabArena introduces a living benchmark for tabular IID data, addressing the stagnation of static benchmarks by implementing continuous curation protocols, a public leaderboard, and maintainers. It initializes the platform with 51 carefully selected datasets and 16 models (including tabular foundation models), using robust evaluation designs that emphasize cross-validation ensembles and hyperparameter exploration. Key findings show that post-hoc ensembling drives peak performance, deep learning methods catch up with adequate time budgets, and tabular foundation models shine on small datasets; ensembles across models push the state-of-the-art further. The work also discusses data curation challenges, reproducibility, and the societal/environmental trade-offs of large-scale benchmarking, laying groundwork for a scalable, open, and responsible living benchmark ecosystem.
Abstract
With the growing popularity of deep learning and foundation models for tabular data, the need for standardized and reliable benchmarks is higher than ever. However, current benchmarks are static. Their design is not updated even if flaws are discovered, model versions are updated, or new models are released. To address this, we introduce TabArena, the first continuously maintained living tabular benchmarking system. To launch TabArena, we manually curate a representative collection of datasets and well-implemented models, conduct a large-scale benchmarking study to initialize a public leaderboard, and assemble a team of experienced maintainers. Our results highlight the influence of validation method and ensembling of hyperparameter configurations to benchmark models at their full potential. While gradient-boosted trees are still strong contenders on practical tabular datasets, we observe that deep learning methods have caught up under larger time budgets with ensembling. At the same time, foundation models excel on smaller datasets. Finally, we show that ensembles across models advance the state-of-the-art in tabular machine learning. We observe that some deep learning models are overrepresented in cross-model ensembles due to validation set overfitting, and we encourage model developers to address this issue. We launch TabArena with a public leaderboard, reproducible code, and maintenance protocols to create a living benchmark available at https://tabarena.ai.
