Table of Contents
Fetching ...

Benchmarking Distribution Shift in Tabular Data with TableShift

Josh Gardner, Zoran Popovic, Ludwig Schmidt

TL;DR

TableShift addresses the lack of benchmarks for distribution shift in tabular data by introducing 15 real-world binary classification tasks with associated shifts and a public Python API for standardized benchmarking. It conducts a large-scale evaluation of 19 model families, including baselines, tabular neural nets, and robustness and domain-generalization methods, revealing a strong linear relationship between ID and OOD accuracy ($\rho=0.81$) and a tight link between shift gaps and label distribution changes ($\rho=0.71$, $R^2=0.993$). The results show that no model consistently outperforms baselines, robustness methods can shrink shift gaps at the cost of ID accuracy, and label shift robustness does not reliably close gaps, highlighting practical implications for deploying tabular models under distribution shift. The benchmark and API enable reproducible, large-scale studies of tabular shift robustness and suggest avenues for improving performance via improved in-distribution accuracy and better handling of label shift. TableShift thus offers a practical, extensible platform for advancing robustness research in tabular machine learning.

Abstract

Robustness to distribution shift has become a growing concern for text and image models as they transition from research subjects to deployment in the real world. However, high-quality benchmarks for distribution shift in tabular machine learning tasks are still lacking despite the widespread real-world use of tabular data and differences in the models used for tabular data in comparison to text and images. As a consequence, the robustness of tabular models to distribution shift is poorly understood. To address this issue, we introduce TableShift, a distribution shift benchmark for tabular data. TableShift contains 15 binary classification tasks in total, each with an associated shift, and includes a diverse set of data sources, prediction targets, and distribution shifts. The benchmark covers domains including finance, education, public policy, healthcare, and civic participation, and is accessible using only a few lines of Python code via the TableShift API. We conduct a large-scale study comparing several state-of-the-art tabular data models alongside robust learning and domain generalization methods on the benchmark tasks. Our study demonstrates (1) a linear trend between in-distribution (ID) and out-of-distribution (OOD) accuracy; (2) domain robustness methods can reduce shift gaps but at the cost of reduced ID accuracy; (3) a strong relationship between shift gap (difference between ID and OOD performance) and shifts in the label distribution. The benchmark data, Python package, model implementations, and more information about TableShift are available at https://github.com/mlfoundations/tableshift and https://tableshift.org .

Benchmarking Distribution Shift in Tabular Data with TableShift

TL;DR

TableShift addresses the lack of benchmarks for distribution shift in tabular data by introducing 15 real-world binary classification tasks with associated shifts and a public Python API for standardized benchmarking. It conducts a large-scale evaluation of 19 model families, including baselines, tabular neural nets, and robustness and domain-generalization methods, revealing a strong linear relationship between ID and OOD accuracy () and a tight link between shift gaps and label distribution changes (, ). The results show that no model consistently outperforms baselines, robustness methods can shrink shift gaps at the cost of ID accuracy, and label shift robustness does not reliably close gaps, highlighting practical implications for deploying tabular models under distribution shift. The benchmark and API enable reproducible, large-scale studies of tabular shift robustness and suggest avenues for improving performance via improved in-distribution accuracy and better handling of label shift. TableShift thus offers a practical, extensible platform for advancing robustness research in tabular machine learning.

Abstract

Robustness to distribution shift has become a growing concern for text and image models as they transition from research subjects to deployment in the real world. However, high-quality benchmarks for distribution shift in tabular machine learning tasks are still lacking despite the widespread real-world use of tabular data and differences in the models used for tabular data in comparison to text and images. As a consequence, the robustness of tabular models to distribution shift is poorly understood. To address this issue, we introduce TableShift, a distribution shift benchmark for tabular data. TableShift contains 15 binary classification tasks in total, each with an associated shift, and includes a diverse set of data sources, prediction targets, and distribution shifts. The benchmark covers domains including finance, education, public policy, healthcare, and civic participation, and is accessible using only a few lines of Python code via the TableShift API. We conduct a large-scale study comparing several state-of-the-art tabular data models alongside robust learning and domain generalization methods on the benchmark tasks. Our study demonstrates (1) a linear trend between in-distribution (ID) and out-of-distribution (OOD) accuracy; (2) domain robustness methods can reduce shift gaps but at the cost of reduced ID accuracy; (3) a strong relationship between shift gap (difference between ID and OOD performance) and shifts in the label distribution. The benchmark data, Python package, model implementations, and more information about TableShift are available at https://github.com/mlfoundations/tableshift and https://tableshift.org .
Paper Structure (53 sections, 4 equations, 11 figures, 20 tables)

This paper contains 53 sections, 4 equations, 11 figures, 20 tables.

Figures (11)

  • Figure 1: In-domain (ID) and out-of-domain (OOD) accuracy show a linear trend across 15 TableShift tasks and 19 model types ($\rho=0.81$). ID accuracy ($x$-axis values) and change in the label distribution $\Delta_y$ (color) together explain 99% of the variance in OOD accuracy ($R^2=0.993$). For exact results see Section \ref{['sec:detailed-results-appendix']}.
  • Figure 2: Results for baselines, robust learning, and domain generalization models across the 15 TableShift benchmark tasks. The $y=x$ line indicates a model with no shift gap, $\Delta_{\textrm{Acc}}=0$ (see Equation \ref{['eqn:domain-gap']}). Clopper-Pearson confidence intervals at $\alpha = 0.05$ shown for all points. Note that domain generalization models are only used on domain generalization tasks (cf. Table \ref{['tab:tasks-summary']}). Results for the remaining TableShift tasks are shown in Figure \ref{['fig:main-results-2']}. For exact results see Section \ref{['sec:detailed-results-appendix']}.
  • Figure 3: Additional results (cf. Figure \ref{['fig:main-results']}). For exact results see Section \ref{['sec:detailed-results-appendix']}.
  • Figure 4: (a): Percentage of Maximum OOD Accuracy (PMA-OOD) across tasks (see Table \ref{['tab:pma-ood']} for exact values). *: domain generalization models and Group DRO can only be trained on the subset of 10 tasks with multiple training subdomains (see "Domain Generalization" column in Table \ref{['tab:tasks-summary']}). (b): Average ID and OOD accuracy by model across domain generalization tasks only. We only show domain generalization tasks in order to compare all models on the same set of tasks. See Figure \ref{['fig:thumbnail-scatter']} for results on all tasks. Exact values in Table \ref{['tab:dg-results']}.
  • Figure 5: Label shift ($\Delta_y$, measured via Equation \ref{['eqn:delta-y']}) and absolute shift gap $\Delta_{\textrm{Acc}}$ show moderate correlation across datasets and models (Pearson correlation $\rho = 0.70$). Exact $\Delta_y$ values in Table \ref{['tab:shift-metrics']}.
  • ...and 6 more figures