Table of Contents
Fetching ...

A Position Paper on the Automatic Generation of Machine Learning Leaderboards

Roelien C Timmer, Yufang Hou, Stephen Wan

TL;DR

The paper addresses fragmentation in Automatic Leaderboard Generation (ALG) by surveying diverse methods and evaluation practices. It introduces a unified ALG conceptual framework and benchmarking guidelines to standardize problem definitions, document representations, and evaluation metrics. It argues for comprehensive leaderboards that include all reported results and richer metadata, enabling flexible filtering and fair comparisons. The work aims to accelerate rigorous, scalable ALG research and practice, with a living GitHub reading list to support ongoing developments.

Abstract

An important task in machine learning (ML) research is comparing prior work, which is often performed via ML leaderboards: a tabular overview of experiments with comparable conditions (e.g., same task, dataset, and metric). However, the growing volume of literature creates challenges in creating and maintaining these leaderboards. To ease this burden, researchers have developed methods to extract leaderboard entries from research papers for automated leaderboard curation. Yet, prior work varies in problem framing, complicating comparisons and limiting real-world applicability. In this position paper, we present the first overview of Automatic Leaderboard Generation (ALG) research, identifying fundamental differences in assumptions, scope, and output formats. We propose an ALG unified conceptual framework to standardise how the ALG task is defined. We offer ALG benchmarking guidelines, including recommendations for datasets and metrics that promote fair, reproducible evaluation. Lastly, we outline challenges and new directions for ALG, such as, advocating for broader coverage by including all reported results and richer metadata.

A Position Paper on the Automatic Generation of Machine Learning Leaderboards

TL;DR

The paper addresses fragmentation in Automatic Leaderboard Generation (ALG) by surveying diverse methods and evaluation practices. It introduces a unified ALG conceptual framework and benchmarking guidelines to standardize problem definitions, document representations, and evaluation metrics. It argues for comprehensive leaderboards that include all reported results and richer metadata, enabling flexible filtering and fair comparisons. The work aims to accelerate rigorous, scalable ALG research and practice, with a living GitHub reading list to support ongoing developments.

Abstract

An important task in machine learning (ML) research is comparing prior work, which is often performed via ML leaderboards: a tabular overview of experiments with comparable conditions (e.g., same task, dataset, and metric). However, the growing volume of literature creates challenges in creating and maintaining these leaderboards. To ease this burden, researchers have developed methods to extract leaderboard entries from research papers for automated leaderboard curation. Yet, prior work varies in problem framing, complicating comparisons and limiting real-world applicability. In this position paper, we present the first overview of Automatic Leaderboard Generation (ALG) research, identifying fundamental differences in assumptions, scope, and output formats. We propose an ALG unified conceptual framework to standardise how the ALG task is defined. We offer ALG benchmarking guidelines, including recommendations for datasets and metrics that promote fair, reproducible evaluation. Lastly, we outline challenges and new directions for ALG, such as, advocating for broader coverage by including all reported results and richer metadata.

Paper Structure

This paper contains 54 sections, 6 equations, 2 figures, 15 tables.

Figures (2)

  • Figure 1: An example of extracting $\langle$task, dataset, metric, method, score$\rangle$ tuples from research papers to build a leaderboard.
  • Figure 2: ALG Unified Conceptual Framework.