Table of Contents
Fetching ...

MultiTab: A Comprehensive Benchmark Suite for Multi-Dimensional Evaluation in Tabular Domains

Kyungeun Lee, Moonjung Eo, Hye-Seung Cho, Dongmin Kim, Ye Seul Sim, Seoyoon Kim, Min-Kook Suh, Woohyung Lim

TL;DR

MultiTab addresses the need for regime-aware evaluation in tabular learning by introducing a data-centric benchmark suite that analyzes 196 datasets with 13 diverse models. By organizing datasets into seven data-statistic axes and optimizing models under consistent cross-validation, it reveals that there is no universal winner and that model inductive biases must align with specific data regimes. The empirical findings show when inter-feature versus inter-sample biases help, and how dataset properties like sample size, feature correlations, and label imbalance modulate performance. The framework enables principled model design and practical guidance for data-aware model selection, with full benchmark artifacts publicly available to support reproducibility and future extensions.

Abstract

Despite the widespread use of tabular data in real-world applications, most benchmarks rely on average-case metrics, which fail to reveal how model behavior varies across diverse data regimes. To address this, we propose MultiTab, a benchmark suite and evaluation framework for multi-dimensional, data-aware analysis of tabular learning algorithms. Rather than comparing models only in aggregate, MultiTab categorizes 196 publicly available datasets along key data characteristics, including sample size, label imbalance, and feature interaction, and evaluates 13 representative models spanning a range of inductive biases. Our analysis shows that model performance is highly sensitive to such regimes: for example, models using sample-level similarity excel on datasets with large sample sizes or high inter-feature correlation, while models encoding inter-feature dependencies perform best with weakly correlated features. These findings reveal that inductive biases do not always behave as intended, and that regime-aware evaluation is essential for understanding and improving model behavior. MultiTab enables more principled model design and offers practical guidance for selecting models tailored to specific data characteristics. All datasets, code, and optimization logs are publicly available at https://huggingface.co/datasets/LGAI-DILab/Multitab.

MultiTab: A Comprehensive Benchmark Suite for Multi-Dimensional Evaluation in Tabular Domains

TL;DR

MultiTab addresses the need for regime-aware evaluation in tabular learning by introducing a data-centric benchmark suite that analyzes 196 datasets with 13 diverse models. By organizing datasets into seven data-statistic axes and optimizing models under consistent cross-validation, it reveals that there is no universal winner and that model inductive biases must align with specific data regimes. The empirical findings show when inter-feature versus inter-sample biases help, and how dataset properties like sample size, feature correlations, and label imbalance modulate performance. The framework enables principled model design and practical guidance for data-aware model selection, with full benchmark artifacts publicly available to support reproducibility and future extensions.

Abstract

Despite the widespread use of tabular data in real-world applications, most benchmarks rely on average-case metrics, which fail to reveal how model behavior varies across diverse data regimes. To address this, we propose MultiTab, a benchmark suite and evaluation framework for multi-dimensional, data-aware analysis of tabular learning algorithms. Rather than comparing models only in aggregate, MultiTab categorizes 196 publicly available datasets along key data characteristics, including sample size, label imbalance, and feature interaction, and evaluates 13 representative models spanning a range of inductive biases. Our analysis shows that model performance is highly sensitive to such regimes: for example, models using sample-level similarity excel on datasets with large sample sizes or high inter-feature correlation, while models encoding inter-feature dependencies perform best with weakly correlated features. These findings reveal that inductive biases do not always behave as intended, and that regime-aware evaluation is essential for understanding and improving model behavior. MultiTab enables more principled model design and offers practical guidance for selecting models tailored to specific data characteristics. All datasets, code, and optimization logs are publicly available at https://huggingface.co/datasets/LGAI-DILab/Multitab.

Paper Structure

This paper contains 78 sections, 10 equations, 14 figures, 26 tables.

Figures (14)

  • Figure 1: Average normalized predictive error across 24 sub-categories for four model classes: GBDTs, NN-Simple, NN-Sample, and NN-Feature. Lower values indicate better performance. Error bars represent 95% confidence intervals. Unlike overall averages, model rankings vary substantially across different data regimes, highlighting the importance of conditional evaluation.
  • Figure 2: Deviation from overall average error across 24 sub-categories. Each cell shows the difference between a model’s average error in a given sub-category and its overall mean across all datasets. Blue (negative) indicates worse-than-average performance; red (positive) indicates better-than-average. This highlights model-specific strengths and weaknesses relative to their overall behavior.
  • Figure 3: Spearman correlations between dataset statistics and model error. Only statistically significant correlations ($p < 0.05$) are shown; blank cells denote non-significance.
  • Figure 4: Distribution of datasets in our benchmark suite. Each point represents a dataset, plotted by sample size (x-axis) and feature dimensionality (y-axis), both on logarithmic scale. Colors indicate task types: binary classification, multiclass classification, and regression. The benchmark ensures broad coverage across different data scales and tasks.
  • Figure 5: Detailed histograms for each sub-category
  • ...and 9 more figures