Table of Contents
Fetching ...

OpenML Benchmarking Suites

Bernd Bischl, Giuseppe Casalicchio, Matthias Feurer, Pieter Gijsbers, Frank Hutter, Michel Lang, Rafael G. Mantovani, Jan N. van Rijn, Joaquin Vanschoren

TL;DR

The paper argues for standardized, reproducible algorithm benchmarking through curated OpenML benchmarking suites and introduces a practical OpenML-CC18 classification benchmark. It provides a framework and tooling to create, retrieve, and run suite-based benchmarks, and demonstrates the approach with CC18 while surveying AutoML and other OpenML suites. Key contributions include formalizing benchmark suites, implementing a workflow for suite curation, and showing how large-scale, comparable results can be shared and reused to drive methodological progress. The work aims to enable dynamic, community-driven benchmarking that tracks progress over time and supports diverse research directions beyond single-dataset evaluations.

Abstract

Machine learning research depends on objectively interpretable, comparable, and reproducible algorithm benchmarks. We advocate the use of curated, comprehensive suites of machine learning tasks to standardize the setup, execution, and reporting of benchmarks. We enable this through software tools that help to create and leverage these benchmarking suites. These are seamlessly integrated into the OpenML platform, and accessible through interfaces in Python, Java, and R. OpenML benchmarking suites (a) are easy to use through standardized data formats, APIs, and client libraries; (b) come with extensive meta-information on the included datasets; and (c) allow benchmarks to be shared and reused in future studies. We then present a first, carefully curated and practical benchmarking suite for classification: the OpenML Curated Classification benchmarking suite 2018 (OpenML-CC18). Finally, we discuss use cases and applications which demonstrate the usefulness of OpenML benchmarking suites and the OpenML-CC18 in particular.

OpenML Benchmarking Suites

TL;DR

The paper argues for standardized, reproducible algorithm benchmarking through curated OpenML benchmarking suites and introduces a practical OpenML-CC18 classification benchmark. It provides a framework and tooling to create, retrieve, and run suite-based benchmarks, and demonstrates the approach with CC18 while surveying AutoML and other OpenML suites. Key contributions include formalizing benchmark suites, implementing a workflow for suite curation, and showing how large-scale, comparable results can be shared and reused to drive methodological progress. The work aims to enable dynamic, community-driven benchmarking that tracks progress over time and supports diverse research directions beyond single-dataset evaluations.

Abstract

Machine learning research depends on objectively interpretable, comparable, and reproducible algorithm benchmarks. We advocate the use of curated, comprehensive suites of machine learning tasks to standardize the setup, execution, and reporting of benchmarks. We enable this through software tools that help to create and leverage these benchmarking suites. These are seamlessly integrated into the OpenML platform, and accessible through interfaces in Python, Java, and R. OpenML benchmarking suites (a) are easy to use through standardized data formats, APIs, and client libraries; (b) come with extensive meta-information on the included datasets; and (c) allow benchmarks to be shared and reused in future studies. We then present a first, carefully curated and practical benchmarking suite for classification: the OpenML Curated Classification benchmarking suite 2018 (OpenML-CC18). Finally, we discuss use cases and applications which demonstrate the usefulness of OpenML benchmarking suites and the OpenML-CC18 in particular.

Paper Structure

This paper contains 21 sections, 3 figures, 1 table.

Figures (3)

  • Figure 1: OpenML website showing a list of benchmark studies on the left, and interactive exploration of the results of the AutoML Benchmark (see Section \ref{['automlbenchmark']}) on the right. Can be viewed online at https://www.openml.org/s/226.
  • Figure 2: Complete code examples, in different programming languages, of how any benchmarking suite (here the 'OpenML-CC18' suite) can be downloaded and used to evaluate a given algorithm. The Python code also creates a new benchmark study and shares all results. Uploading requires a (free) API key.
  • Figure 3: Distribution of the scores (average area under ROC curve, weighted by class support) of 3.8 million experiments with thousands of machine learning pipelines, shared on the CC18 benchmark tasks. Some tasks prove harder than others, some have wide score ranges, and for all there exist models that perform poorly (0.5 AUC). Code to reproduce this figure (for any metric) is available on GitHub.$^{\ref{['footnote-notebook']}}$