Table of Contents
Fetching ...

Unreflected Use of Tabular Data Repositories Can Undermine Research Quality

Andrej Tschalzev, Lennart Purucker, Stefan Lüdtke, Frank Hutter, Christian Bartelt, Heiner Stuckenschmidt

TL;DR

The paper analyzes how unreflected use of tabular datasets in data repositories can degrade research quality, focusing on OpenML. By examining two influential benchmarks, TabZilla-hard and Grinsztajn, it shows that inappropriate per-dataset validation, missing objective baselines, and insufficient task-specific preprocessing can distort conclusions. Re-evaluating with stronger baselines, 5-fold cross-validation, and careful preprocessing demonstrates that prior results may underestimate performance and mislead conclusions. To address these issues, the authors propose repository-level remedies, including default evaluation tasks with explicit validation and preprocessing guidelines and a robust per-dataset baseline, to enhance reproducibility and the credibility of tabular data research.

Abstract

Data repositories have accumulated a large number of tabular datasets from various domains. Machine Learning researchers are actively using these datasets to evaluate novel approaches. Consequently, data repositories have an important standing in tabular data research. They not only host datasets but also provide information on how to use them in supervised learning tasks. In this paper, we argue that, despite great achievements in usability, the unreflected usage of datasets from data repositories may have led to reduced research quality and scientific rigor. We present examples from prominent recent studies that illustrate the problematic use of datasets from OpenML, a large data repository for tabular data. Our illustrations help users of data repositories avoid falling into the traps of (1) using suboptimal model selection strategies, (2) overlooking strong baselines, and (3) inappropriate preprocessing. In response, we discuss possible solutions for how data repositories can prevent the inappropriate use of datasets and become the cornerstones for improved overall quality of empirical research studies.

Unreflected Use of Tabular Data Repositories Can Undermine Research Quality

TL;DR

The paper analyzes how unreflected use of tabular datasets in data repositories can degrade research quality, focusing on OpenML. By examining two influential benchmarks, TabZilla-hard and Grinsztajn, it shows that inappropriate per-dataset validation, missing objective baselines, and insufficient task-specific preprocessing can distort conclusions. Re-evaluating with stronger baselines, 5-fold cross-validation, and careful preprocessing demonstrates that prior results may underestimate performance and mislead conclusions. To address these issues, the authors propose repository-level remedies, including default evaluation tasks with explicit validation and preprocessing guidelines and a robust per-dataset baseline, to enhance reproducibility and the credibility of tabular data research.

Abstract

Data repositories have accumulated a large number of tabular datasets from various domains. Machine Learning researchers are actively using these datasets to evaluate novel approaches. Consequently, data repositories have an important standing in tabular data research. They not only host datasets but also provide information on how to use them in supervised learning tasks. In this paper, we argue that, despite great achievements in usability, the unreflected usage of datasets from data repositories may have led to reduced research quality and scientific rigor. We present examples from prominent recent studies that illustrate the problematic use of datasets from OpenML, a large data repository for tabular data. Our illustrations help users of data repositories avoid falling into the traps of (1) using suboptimal model selection strategies, (2) overlooking strong baselines, and (3) inappropriate preprocessing. In response, we discuss possible solutions for how data repositories can prevent the inappropriate use of datasets and become the cornerstones for improved overall quality of empirical research studies.

Paper Structure

This paper contains 16 sections, 4 figures, 8 tables.

Figures (4)

  • Figure 1: 5-fold cross-validation (5CV) consistently improves performance over holdout validation. We show the performance gains obtained by switching from holdout validation to 5CV for the TabZilla benchmark. Each model is compared to itself after 100 trials in both model selection protocols. For each dataset, the zero line corresponds to the model's performance with holdout selection. Positive values mean better performance with the 5CV protocol. The performance gains are capped at 0.08. The datasets are sorted by sample size in ascending order.
  • Figure 2: Holdout validation is prone to make model selection miss the highest achievable performance. We show the cumulative density functions of test logloss performance over 100 trials for MLPs and XGBoost. One fold is displayed for each dataset. Stars with vertical lines denote the model selected based on the best validation performance. A position closer to the right on the x-axis means better absolute logloss performance. Two models with the same value on the x-axis mean equal performance. Steeper ascents on the y axis represent dense regions with more trials achieving similar performance.
  • Figure 3: Target leaks can alter performance comparisons. For each dataset, the performance of the models before and after resolving target leaks in the original benchmark is displayed.
  • Figure 4: Simple yet effective preprocessing decisions can entirely change model comparisons. AUC performance is displayed for classification tasks, and R2 for regression (seattlecrime & nyc-taxi). 'Raw' denotes the (already preprocessed) dataset version provided by the benchmark. Transformed denotes a dataset after simple preprocessing. This includes treating ordinal features as categorical (electricity & seattle), adding the difference between two dates as a feature (nyc-taxi), adding fractions of features as new features (road-safety), and feature selection (guillermo). A horizontal line marks the highest achieved performance.