Table of Contents
Fetching ...

On the Workflows and Smells of Leaderboard Operations (LBOps): An Exploratory Study of Foundation Model Leaderboards

Zhimin Zhao, Abdul Ali Bangash, Filipe Roseiro Côgo, Bram Adams, Ahmed E. Hassan

TL;DR

The paper investigates Leaderboard Operations (LBOps) for Foundation Model (FM) leaderboards, collecting up to $1{,}045$ leaderboards from five sources to uncover five workflow patterns and a domain model that formalizes key concepts. It introduces eight leaderboard smells that impair reliability, many affecting the ranking dataframe and evaluation records, and shows that the most prevalent pattern is External Evaluation Integration ($P_1$). By combining qualitative methods (card sorting, negotiated agreement) with operator validation, the authors propose practical implications, including a Leaderboard Bill of Materials (LBOM) and community forums, to enhance transparency, accountability, and long-term sustainability of FM leaderboards. The work provides data, a replication package, and tooling to support standardized LBOps practices and more trustworthy model comparisons in software engineering practice. Overall, it establishes LBOps as a discipline bridging SE and ML evaluation, with concrete guidance for operators, developers, and researchers.

Abstract

Foundation models (FM), such as large language models (LLMs), which are large-scale machine learning (ML) models, have demonstrated remarkable adaptability in various downstream software engineering (SE) tasks, such as code completion, code understanding, and software development. As a result, FM leaderboards have become essential tools for SE teams to compare and select the best third-party FMs for their specific products and purposes. However, the lack of standardized guidelines for FM evaluation and comparison threatens the transparency of FM leaderboards and limits stakeholders' ability to perform effective FM selection. As a first step towards addressing this challenge, our research focuses on understanding how these FM leaderboards operate in real-world scenarios ("leaderboard operations") and identifying potential pitfalls and areas for improvement ("leaderboard smells"). In this regard, we collect up to 1,045 FM leaderboards from five different sources: GitHub, Hugging Face Spaces, Papers With Code, spreadsheet and independent platform, to examine their documentation and engage in direct communication with leaderboard operators to understand their workflows. Through card sorting and negotiated agreement, we identify five distinct workflow patterns and develop a domain model that captures the key components and their interactions within these workflows. We then identify eight unique types of leaderboard smells in LBOps. By mitigating these smells, SE teams can improve transparency, accountability, and collaboration in current LBOps practices, fostering a more robust and responsible ecosystem for FM comparison and selection.

On the Workflows and Smells of Leaderboard Operations (LBOps): An Exploratory Study of Foundation Model Leaderboards

TL;DR

The paper investigates Leaderboard Operations (LBOps) for Foundation Model (FM) leaderboards, collecting up to leaderboards from five sources to uncover five workflow patterns and a domain model that formalizes key concepts. It introduces eight leaderboard smells that impair reliability, many affecting the ranking dataframe and evaluation records, and shows that the most prevalent pattern is External Evaluation Integration (). By combining qualitative methods (card sorting, negotiated agreement) with operator validation, the authors propose practical implications, including a Leaderboard Bill of Materials (LBOM) and community forums, to enhance transparency, accountability, and long-term sustainability of FM leaderboards. The work provides data, a replication package, and tooling to support standardized LBOps practices and more trustworthy model comparisons in software engineering practice. Overall, it establishes LBOps as a discipline bridging SE and ML evaluation, with concrete guidance for operators, developers, and researchers.

Abstract

Foundation models (FM), such as large language models (LLMs), which are large-scale machine learning (ML) models, have demonstrated remarkable adaptability in various downstream software engineering (SE) tasks, such as code completion, code understanding, and software development. As a result, FM leaderboards have become essential tools for SE teams to compare and select the best third-party FMs for their specific products and purposes. However, the lack of standardized guidelines for FM evaluation and comparison threatens the transparency of FM leaderboards and limits stakeholders' ability to perform effective FM selection. As a first step towards addressing this challenge, our research focuses on understanding how these FM leaderboards operate in real-world scenarios ("leaderboard operations") and identifying potential pitfalls and areas for improvement ("leaderboard smells"). In this regard, we collect up to 1,045 FM leaderboards from five different sources: GitHub, Hugging Face Spaces, Papers With Code, spreadsheet and independent platform, to examine their documentation and engage in direct communication with leaderboard operators to understand their workflows. Through card sorting and negotiated agreement, we identify five distinct workflow patterns and develop a domain model that captures the key components and their interactions within these workflows. We then identify eight unique types of leaderboard smells in LBOps. By mitigating these smells, SE teams can improve transparency, accountability, and collaboration in current LBOps practices, fostering a more robust and responsible ecosystem for FM comparison and selection.
Paper Structure (28 sections, 5 figures, 4 tables)

This paper contains 28 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Three-phase study workflow: (1) Leaderboard Collection – Collect ML leaderboards from GitHub, HF Spaces, and PWC, to build a comprehensive dataset; (2) Leaderboard Filtering – Apply predefined inclusion/exclusion criteria to manually review and curate the collected leaderboards; (3) Leaderboard Analysis – Investigate leaderboard documentation, evaluation methodologies, and operational workflows, engaging with operators to derive actionable insights.
  • Figure 2: Distribution of FM leaderboards across various different sources. The abbreviations used are: GH (GitHub), HF (Hugging Face Spaces), PWC (Papers With Code), IP (independent platform), and SP (spreadsheet platform). Comma-separated names indicate leaderboards hosted on multiple sources.
  • Figure 3: Schematic representation of workflow patterns in LBOps, ordered by the number of operations. The arrow indicates the execution sequence; the block represents an operation; the circle denotes an artifact or access to it; the color signifies a role (mixed colors indicate multiple possible roles); the loop symbol marks a continuous integration process.
  • Figure 4: The domain model of LBOps leveraged by the five identified workflow patterns. The level of adherence depends on the specific pattern and leaderboard.
  • Figure 5: Collage screenshot of leaderboard components using elements from the https://lmarena.ai/?leaderboard and https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard.