Table of Contents
Fetching ...

Benchmark Data Repositories for Better Benchmarking

Rachel Longjohn, Markelle Kelly, Sameer Singh, Padhraic Smyth

TL;DR

The landscape of these benchmark data repositories is analyzed and a set of considerations surrounding the design and use of benchmark data repositories are identified, with a focus on improving benchmarking practices in machine learning.

Abstract

In machine learning research, it is common to evaluate algorithms via their performance on standard benchmark datasets. While a growing body of work establishes guidelines for -- and levies criticisms at -- data and benchmarking practices in machine learning, comparatively less attention has been paid to the data repositories where these datasets are stored, documented, and shared. In this paper, we analyze the landscape of these $\textit{benchmark data repositories}$ and the role they can play in improving benchmarking. This role includes addressing issues with both datasets themselves (e.g., representational harms, construct validity) and the manner in which evaluation is carried out using such datasets (e.g., overemphasis on a few datasets and metrics, lack of reproducibility). To this end, we identify and discuss a set of considerations surrounding the design and use of benchmark data repositories, with a focus on improving benchmarking practices in machine learning.

Benchmark Data Repositories for Better Benchmarking

TL;DR

The landscape of these benchmark data repositories is analyzed and a set of considerations surrounding the design and use of benchmark data repositories are identified, with a focus on improving benchmarking practices in machine learning.

Abstract

In machine learning research, it is common to evaluate algorithms via their performance on standard benchmark datasets. While a growing body of work establishes guidelines for -- and levies criticisms at -- data and benchmarking practices in machine learning, comparatively less attention has been paid to the data repositories where these datasets are stored, documented, and shared. In this paper, we analyze the landscape of these and the role they can play in improving benchmarking. This role includes addressing issues with both datasets themselves (e.g., representational harms, construct validity) and the manner in which evaluation is carried out using such datasets (e.g., overemphasis on a few datasets and metrics, lack of reproducibility). To this end, we identify and discuss a set of considerations surrounding the design and use of benchmark data repositories, with a focus on improving benchmarking practices in machine learning.

Paper Structure

This paper contains 22 sections, 6 figures.

Figures (6)

  • Figure 1: Examples of DOIs and citations in repositories.
  • Figure 2: Examples of connecting datasets to papers in repositories.
  • Figure 3: The Statlog (German Credit Data) dataset statlog_german_credit_data_144, hosted by the UCI ML Repository, is a sample of customer records from a German bank, with the task of classifying each individual as a good or bad credit risk. In the repository documentation, 8 categorical variables have their levels mixed up or incorrectly described (e.g., see attribute 15, the type of housing the debtor lives in, above). Groemping groemping2019south tracked down papers which describe the dataset's origins haubetaler1979empirischehaubetaler1981methodenfahrmeir1981kategorialefahrmeir1984multivariate to construct a proper code table. She donated the corrected dataset as the South German Credit dataset in 2019 groemping2019south but the original dataset from 1994 has nonetheless been widely used in ML research.
  • Figure 4: The Iris dataset from the UCI ML Repository is widely used for evaluating clustering and classification algorithms misc_iris_53. Each observation corresponds to an iris flower, including sepal and petal measurements and its specific species (out of three classes). After years of use, it was discovered that there were multiple different widely-publicized versions of this dataset, with differing measurements for certain observations. Consequently, the reported performances of classification models on Iris (across a large number of published papers) are not necessarily comparable bezdek1999will.
  • Figure 5: The Papers with Code dataset page for the deprecated Tiny Images dataset.
  • ...and 1 more figures