Table of Contents
Fetching ...

TabReD: Analyzing Pitfalls and Filling the Gaps in Tabular Deep Learning Benchmarks

Ivan Rubachev, Nikolay Kartashev, Yury Gorishniy, Artem Babenko

TL;DR

This work analyzes existing tabular benchmarks and introduces TabReD -- a collection of eight industry-grade tabular datasets that reassess a large number of tabular ML models and techniques on TabReD and demonstrates that evaluation on time-based data splits leads to different methods ranking, compared to evaluation on random splits, which are common in current benchmarks.

Abstract

Advances in machine learning research drive progress in real-world applications. To ensure this progress, it is important to understand the potential pitfalls on the way from a novel method's success on academic benchmarks to its practical deployment. In this work, we analyze existing tabular benchmarks and find two common characteristics of tabular data in typical industrial applications that are underrepresented in the datasets usually used for evaluation in the literature. First, in real-world deployment scenarios, distribution of data often changes over time. To account for this distribution drift, time-based train/test splits should be used in evaluation. However, popular tabular datasets often lack timestamp metadata to enable such evaluation. Second, a considerable portion of datasets in production settings stem from extensive data acquisition and feature engineering pipelines. This can have an impact on the absolute and relative number of predictive, uninformative, and correlated features compared to academic datasets. In this work, we aim to understand how recent research advances in tabular deep learning transfer to these underrepresented conditions. To this end, we introduce TabReD -- a collection of eight industry-grade tabular datasets. We reassess a large number of tabular ML models and techniques on TabReD. We demonstrate that evaluation on time-based data splits leads to different methods ranking, compared to evaluation on random splits, which are common in current benchmarks. Furthermore, simple MLP-like architectures and GBDT show the best results on the TabReD datasets, while other methods are less effective in the new setting.

TabReD: Analyzing Pitfalls and Filling the Gaps in Tabular Deep Learning Benchmarks

TL;DR

This work analyzes existing tabular benchmarks and introduces TabReD -- a collection of eight industry-grade tabular datasets that reassess a large number of tabular ML models and techniques on TabReD and demonstrates that evaluation on time-based data splits leads to different methods ranking, compared to evaluation on random splits, which are common in current benchmarks.

Abstract

Advances in machine learning research drive progress in real-world applications. To ensure this progress, it is important to understand the potential pitfalls on the way from a novel method's success on academic benchmarks to its practical deployment. In this work, we analyze existing tabular benchmarks and find two common characteristics of tabular data in typical industrial applications that are underrepresented in the datasets usually used for evaluation in the literature. First, in real-world deployment scenarios, distribution of data often changes over time. To account for this distribution drift, time-based train/test splits should be used in evaluation. However, popular tabular datasets often lack timestamp metadata to enable such evaluation. Second, a considerable portion of datasets in production settings stem from extensive data acquisition and feature engineering pipelines. This can have an impact on the absolute and relative number of predictive, uninformative, and correlated features compared to academic datasets. In this work, we aim to understand how recent research advances in tabular deep learning transfer to these underrepresented conditions. To this end, we introduce TabReD -- a collection of eight industry-grade tabular datasets. We reassess a large number of tabular ML models and techniques on TabReD. We demonstrate that evaluation on time-based data splits leads to different methods ranking, compared to evaluation on random splits, which are common in current benchmarks. Furthermore, simple MLP-like architectures and GBDT show the best results on the TabReD datasets, while other methods are less effective in the new setting.
Paper Structure (25 sections, 4 figures, 5 tables)

This paper contains 25 sections, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Comparison of tabular DL algorithmic improvements on TabReD and on a popular benchmark. We plot average relative percentage improvement over the MLP baseline on both benchmarks. Ensembling and Numerical Embeddings successfully transfer to TabReD. However, success of retrieval-based models and improved training methods does not translate to the proposed setting.
  • Figure 2: Comparison of performance on out-of-time and random in-domain test sets. The first row contains regression datasets, the metric is RMSE (lower is better). The second row contains binary classification datasets, the metric is AUC-ROC (higher is better). We can see the change in relative ranks and performance difference in addition to the overall performance drop. In particular, XGBoost lead decreases when comparing performance on task-appropriate time-shifted test sets.
  • Figure 3: Relationship between distribution shift and performance on a subset of TabReD datasets. We use std in ensemble of MLP predictions as a proxy for the distribution shift. On the right side, we show errors (MAE for regression and error rate for binary classification)
  • Figure 4: Feature correlations and importance via mutual information with target. On the left are datasets from tabr, on the right are TabReD datasets. Datasets on the right are clearly more complex in terms of number of features and their correlation and importance patterns. The only comparably complex dataset on the left is Microsoft.