A Comprehensive Benchmark of Machine and Deep Learning Across Diverse Tabular Datasets
Assaf Shmuel, Oren Glickman, Teddy Lazebnik
TL;DR
The paper tackles the question of when deep learning can beat traditional ML on tabular data by conducting a large-scale benchmark across 111 diverse datasets and 20 model configurations (DL, tree ensembles, and classical ML). It finds that tree-based ensembles generally outperform DL on tabular tasks, though DL gains appear in specific regimes, notably small datasets with high kurtosis; a meta-learning approach can predict DL advantage with about 86.1% accuracy (AUC 0.78) and yields interpretable rules via logistic and symbolic regression. The work provides actionable guidance for model selection on tabular data and contributes a rich set of dataset meta-features and results that complement prior benchmarks. Overall, the findings underscore that DL is not a one-size-fits-all solution for tabular datasets and that meta-learned guidance can inform practical modeling choices.
Abstract
The analysis of tabular datasets is highly prevalent both in scientific research and real-world applications of Machine Learning (ML). Unlike many other ML tasks, Deep Learning (DL) models often do not outperform traditional methods in this area. Previous comparative benchmarks have shown that DL performance is frequently equivalent or even inferior to models such as Gradient Boosting Machines (GBMs). In this study, we introduce a comprehensive benchmark aimed at better characterizing the types of datasets where DL models excel. Although several important benchmarks for tabular datasets already exist, our contribution lies in the variety and depth of our comparison: we evaluate 111 datasets with 20 different models, including both regression and classification tasks. These datasets vary in scale and include both those with and without categorical variables. Importantly, our benchmark contains a sufficient number of datasets where DL models perform best, allowing for a thorough analysis of the conditions under which DL models excel. Building on the results of this benchmark, we train a model that predicts scenarios where DL models outperform alternative methods with 86.1% accuracy (AUC 0.78). We present insights derived from this characterization and compare these findings to previous benchmarks.
