Table of Contents
Fetching ...

A Closer Look at Deep Learning Methods on Tabular Datasets

Han-Jia Ye, Si-Yang Liu, Hao-Run Cai, Qi-Le Zhou, De-Chuan Zhan

TL;DR

The paper presents TALENT, a large-scale, up-to-date benchmark for deep tabular learning using 300+ datasets to compare classical, deep, and pretrained foundation models. It reveals that top performance concentrates in a small shortlist of methods, with pretrained models (notably TabPFN v2 and TabICL) often leading, yet tree ensembles remain robust baselines and ensembling bolsters both tree-based and neural approaches. A dynamics-aware, heterogeneity-focused analysis links meta-features to training trajectories, showing that feature diversity and the mix of numeric/categorical attributes largely shape method preferences. The two-level design, including Talent-tiny and Talent-extension, enables fast, reproducible evaluation and stress-testing in high-dimensional, many-class, and large-scale regimes, yielding actionable guidance for selecting and improving deep tabular learning systems.

Abstract

Tabular data is prevalent across diverse domains in machine learning. With the rapid progress of deep tabular prediction methods, especially pretrained (foundation) models, there is a growing need to evaluate these methods systematically and to understand their behavior. We present an extensive study on TALENT, a collection of 300+ datasets spanning broad ranges of size, feature composition (numerical/categorical mixes), domains, and output types (binary, multi--class, regression). Our evaluation shows that ensembling benefits both tree-based and neural approaches. Traditional gradient-boosted trees remain very strong baselines, yet recent pretrained tabular models now match or surpass them on many tasks, narrowing--but not eliminating--the historical advantage of tree ensembles. Despite architectural diversity, top performance concentrates within a small subset of models, providing practical guidance for method selection. To explain these outcomes, we quantify dataset heterogeneity by learning from meta-features and early training dynamics to predict later validation behavior. This dynamics-aware analysis indicates that heterogeneity--such as the interplay of categorical and numerical attributes--largely determines which family of methods is favored. Finally, we introduce a two-level design beyond the 300 common-size datasets: a compact TALENT-tiny core (45 datasets) for rapid, reproducible evaluation, and a TALENT-extension suite targeting high-dimensional, many-class, and very large-scale settings for stress testing. In summary, these results offer actionable insights into the strengths, limitations, and future directions for improving deep tabular learning.

A Closer Look at Deep Learning Methods on Tabular Datasets

TL;DR

The paper presents TALENT, a large-scale, up-to-date benchmark for deep tabular learning using 300+ datasets to compare classical, deep, and pretrained foundation models. It reveals that top performance concentrates in a small shortlist of methods, with pretrained models (notably TabPFN v2 and TabICL) often leading, yet tree ensembles remain robust baselines and ensembling bolsters both tree-based and neural approaches. A dynamics-aware, heterogeneity-focused analysis links meta-features to training trajectories, showing that feature diversity and the mix of numeric/categorical attributes largely shape method preferences. The two-level design, including Talent-tiny and Talent-extension, enables fast, reproducible evaluation and stress-testing in high-dimensional, many-class, and large-scale regimes, yielding actionable guidance for selecting and improving deep tabular learning systems.

Abstract

Tabular data is prevalent across diverse domains in machine learning. With the rapid progress of deep tabular prediction methods, especially pretrained (foundation) models, there is a growing need to evaluate these methods systematically and to understand their behavior. We present an extensive study on TALENT, a collection of 300+ datasets spanning broad ranges of size, feature composition (numerical/categorical mixes), domains, and output types (binary, multi--class, regression). Our evaluation shows that ensembling benefits both tree-based and neural approaches. Traditional gradient-boosted trees remain very strong baselines, yet recent pretrained tabular models now match or surpass them on many tasks, narrowing--but not eliminating--the historical advantage of tree ensembles. Despite architectural diversity, top performance concentrates within a small subset of models, providing practical guidance for method selection. To explain these outcomes, we quantify dataset heterogeneity by learning from meta-features and early training dynamics to predict later validation behavior. This dynamics-aware analysis indicates that heterogeneity--such as the interplay of categorical and numerical attributes--largely determines which family of methods is favored. Finally, we introduce a two-level design beyond the 300 common-size datasets: a compact TALENT-tiny core (45 datasets) for rapid, reproducible evaluation, and a TALENT-extension suite targeting high-dimensional, many-class, and very large-scale settings for stress testing. In summary, these results offer actionable insights into the strengths, limitations, and future directions for improving deep tabular learning.
Paper Structure (47 sections, 6 equations, 21 figures, 5 tables)

This paper contains 47 sections, 6 equations, 21 figures, 5 tables.

Figures (21)

  • Figure 1: Performance–efficiency–size comparison of representative tabular methods on Talent for (a) binary classification, (b) multi-class classification, (c) regression, and (d) all tasks. The performance is measured by the average rank of all methods (lower is better). The efficiency is measured by the average training time in seconds (lower is better). The model size is measured based on the average size of all models (the larger the radius, the larger the model).
  • Figure 2: Advantages of the proposed benchmark. (a) shows the number of datasets for three tabular prediction tasks. (b) shows the histogram of datasets across various domains, as well as the types of attributes. (c) shows the number of datasets along with the change of their sizes ($N\times d$). (d) shows the histogram of the number of categorical features in datasets with categorical features. (e) shows the histogram of the imbalance rate for classification datasets. (f) shows the histogram of the number of classes for multi-class classification datasets.
  • Figure 3: Critical difference of all methods via the Wilcoxon-Holm test with a significance level of 0.05. The lower the rank value, the better the performance.
  • Figure 4: The Box-Plot of relative performance improvements of tabular methods over the MLP baseline across binary classification, multi-class classification, and regression tasks. The relative improvement is calculated for each dataset, where larger values indicate stronger performance relative to the MLP baseline. The box plots show the median, interquartile range (IQR), and outliers for each method. Methods with narrower IQRs demonstrate greater stability, while wider distributions suggest variability in performance.
  • Figure 5: PAMA (Probability of Achieving the Best Accuracy) of various methods in binary classification (a), multi-class classification (b), regression (c), and all tasks (d). Each bar segment denotes a tabular method, whose width is the percentage that the method achieves the best performance over a kind of tabular prediction task. The wider the cell, the more often that a method performs well on the tabular prediction task.
  • ...and 16 more figures