Table of Contents
Fetching ...

Efficient Performance Tracking: Leveraging Large Language Models for Automated Construction of Scientific Leaderboards

Furkan Şahinuç, Thy Thy Tran, Yulia Grishina, Yufang Hou, Bei Chen, Iryna Gurevych

TL;DR

This work tackles the challenge of automatically constructing and maintaining scientific leaderboards amid exploding publication volume by introducing SciLead, a manually curated dataset with exhaustive Task-Dataset-Metric-Result ($TDMR$) annotations across 27 leaderboards from 43 NLP papers. It proposes a three-stage framework based on Retrieval-Augmented Generation to extract $TDMR$ tuples, normalize them to a taxonomy or dynamically create new leaderboards, and rank papers to form leaderboards. The authors evaluate multiple LLMs under fully pre-defined, partially pre-defined, and cold-start normalization settings, using ETM and IIM as well as leaderboard-level metrics (Recall, Precision, F1, PC, RC, AO). Results show that while LLMs are strong at identifying $TDM$ names and leaderboard structures, they struggle with extracting exact result values, with GPT-4 Turbo typically performing best in realistic scenarios and even showing robustness in cold-start cases. The work demonstrates practical potential for scalable, automated benchmarking and provides insights into the remaining bottlenections and directions for broadening coverage and improving result extraction.

Abstract

Scientific leaderboards are standardized ranking systems that facilitate evaluating and comparing competitive methods. Typically, a leaderboard is defined by a task, dataset, and evaluation metric (TDM) triple, allowing objective performance assessment and fostering innovation through benchmarking. However, the exponential increase in publications has made it infeasible to construct and maintain these leaderboards manually. Automatic leaderboard construction has emerged as a solution to reduce manual labor. Existing datasets for this task are based on the community-contributed leaderboards without additional curation. Our analysis shows that a large portion of these leaderboards are incomplete, and some of them contain incorrect information. In this work, we present SciLead, a manually-curated Scientific Leaderboard dataset that overcomes the aforementioned problems. Building on this dataset, we propose three experimental settings that simulate real-world scenarios where TDM triples are fully defined, partially defined, or undefined during leaderboard construction. While previous research has only explored the first setting, the latter two are more representative of real-world applications. To address these diverse settings, we develop a comprehensive LLM-based framework for constructing leaderboards. Our experiments and analysis reveal that various LLMs often correctly identify TDM triples while struggling to extract result values from publications. We make our code and data publicly available.

Efficient Performance Tracking: Leveraging Large Language Models for Automated Construction of Scientific Leaderboards

TL;DR

This work tackles the challenge of automatically constructing and maintaining scientific leaderboards amid exploding publication volume by introducing SciLead, a manually curated dataset with exhaustive Task-Dataset-Metric-Result () annotations across 27 leaderboards from 43 NLP papers. It proposes a three-stage framework based on Retrieval-Augmented Generation to extract tuples, normalize them to a taxonomy or dynamically create new leaderboards, and rank papers to form leaderboards. The authors evaluate multiple LLMs under fully pre-defined, partially pre-defined, and cold-start normalization settings, using ETM and IIM as well as leaderboard-level metrics (Recall, Precision, F1, PC, RC, AO). Results show that while LLMs are strong at identifying names and leaderboard structures, they struggle with extracting exact result values, with GPT-4 Turbo typically performing best in realistic scenarios and even showing robustness in cold-start cases. The work demonstrates practical potential for scalable, automated benchmarking and provides insights into the remaining bottlenections and directions for broadening coverage and improving result extraction.

Abstract

Scientific leaderboards are standardized ranking systems that facilitate evaluating and comparing competitive methods. Typically, a leaderboard is defined by a task, dataset, and evaluation metric (TDM) triple, allowing objective performance assessment and fostering innovation through benchmarking. However, the exponential increase in publications has made it infeasible to construct and maintain these leaderboards manually. Automatic leaderboard construction has emerged as a solution to reduce manual labor. Existing datasets for this task are based on the community-contributed leaderboards without additional curation. Our analysis shows that a large portion of these leaderboards are incomplete, and some of them contain incorrect information. In this work, we present SciLead, a manually-curated Scientific Leaderboard dataset that overcomes the aforementioned problems. Building on this dataset, we propose three experimental settings that simulate real-world scenarios where TDM triples are fully defined, partially defined, or undefined during leaderboard construction. While previous research has only explored the first setting, the latter two are more representative of real-world applications. To address these diverse settings, we develop a comprehensive LLM-based framework for constructing leaderboards. Our experiments and analysis reveal that various LLMs often correctly identify TDM triples while struggling to extract result values from publications. We make our code and data publicly available.
Paper Structure (39 sections, 5 equations, 2 figures, 24 tables, 1 algorithm)

This paper contains 39 sections, 5 equations, 2 figures, 24 tables, 1 algorithm.

Figures (2)

  • Figure 1: We first extract task, dataset, metric, and result (TDMR) tuples from scientific publications. Then, we update existing leaderboards of the same TDM (purple and blue). Different from previous work, we also construct a new leaderboard on demand (green).
  • Figure 2: Our framework in three steps: (1) TDMR Extraction, (2) Normalization, (3) Leaderboard Construction