A Comprehensive Benchmark of Machine and Deep Learning Across Diverse Tabular Datasets

Assaf Shmuel; Oren Glickman; Teddy Lazebnik

A Comprehensive Benchmark of Machine and Deep Learning Across Diverse Tabular Datasets

Assaf Shmuel, Oren Glickman, Teddy Lazebnik

TL;DR

The paper tackles the question of when deep learning can beat traditional ML on tabular data by conducting a large-scale benchmark across 111 diverse datasets and 20 model configurations (DL, tree ensembles, and classical ML). It finds that tree-based ensembles generally outperform DL on tabular tasks, though DL gains appear in specific regimes, notably small datasets with high kurtosis; a meta-learning approach can predict DL advantage with about 86.1% accuracy (AUC 0.78) and yields interpretable rules via logistic and symbolic regression. The work provides actionable guidance for model selection on tabular data and contributes a rich set of dataset meta-features and results that complement prior benchmarks. Overall, the findings underscore that DL is not a one-size-fits-all solution for tabular datasets and that meta-learned guidance can inform practical modeling choices.

Abstract

The analysis of tabular datasets is highly prevalent both in scientific research and real-world applications of Machine Learning (ML). Unlike many other ML tasks, Deep Learning (DL) models often do not outperform traditional methods in this area. Previous comparative benchmarks have shown that DL performance is frequently equivalent or even inferior to models such as Gradient Boosting Machines (GBMs). In this study, we introduce a comprehensive benchmark aimed at better characterizing the types of datasets where DL models excel. Although several important benchmarks for tabular datasets already exist, our contribution lies in the variety and depth of our comparison: we evaluate 111 datasets with 20 different models, including both regression and classification tasks. These datasets vary in scale and include both those with and without categorical variables. Importantly, our benchmark contains a sufficient number of datasets where DL models perform best, allowing for a thorough analysis of the conditions under which DL models excel. Building on the results of this benchmark, we train a model that predicts scenarios where DL models outperform alternative methods with 86.1% accuracy (AUC 0.78). We present insights derived from this characterization and compare these findings to previous benchmarks.

A Comprehensive Benchmark of Machine and Deep Learning Across Diverse Tabular Datasets

TL;DR

Abstract

Paper Structure (15 sections, 3 equations, 7 figures, 25 tables)

This paper contains 15 sections, 3 equations, 7 figures, 25 tables.

Introduction
Experimental setup
Datasets
Machine learning and deep learning models
Evaluation strategy
Meta-analysis profiling
Results
Model Ranking
Meta-Analysis Profiling
Discussion
Supplementary Material
Description of models
Meta-learning features
Additional results
Computer resources

Figures (7)

Figure 1: Critical difference diagram for regression tasks based on RMSE. The best performing model is AutoGluon as lower RMSE scores indicate better performance.
Figure 2: Critical difference diagram for classification tasks based on accuracy. The best performing model is AutoGluon as higher accuracy scores indicate better performance.
Figure 3: The effect of various factors on the probability that DL outperforms ML. The heatmaps are generated using the prediction of the logistic regression models. The scatter plot represents the actual observations of the datasets. (a) the impact of the number of columns and rows; (b) the influence of numerical and categorical feature counts; (c) the effect of X-kurtosis and row count; and (d) the role of PCA components necessary to maintain 99% of the variance and number of rows.
Figure 4: Critical difference diagram for regression tasks based on MAE. The best performing model is AutoGluon as lower MAE scores indicate better performance.
Figure 5: Critical difference diagram for regression tasks based on $R^2$. The best performing model is AutoGluon as higher $R^2$ scores indicate better performance.
...and 2 more figures

A Comprehensive Benchmark of Machine and Deep Learning Across Diverse Tabular Datasets

TL;DR

Abstract

A Comprehensive Benchmark of Machine and Deep Learning Across Diverse Tabular Datasets

Authors

TL;DR

Abstract

Table of Contents

Figures (7)