Efficient Performance Tracking: Leveraging Large Language Models for Automated Construction of Scientific Leaderboards

Furkan Şahinuç; Thy Thy Tran; Yulia Grishina; Yufang Hou; Bei Chen; Iryna Gurevych

Efficient Performance Tracking: Leveraging Large Language Models for Automated Construction of Scientific Leaderboards

Furkan Şahinuç, Thy Thy Tran, Yulia Grishina, Yufang Hou, Bei Chen, Iryna Gurevych

TL;DR

This work tackles the challenge of automatically constructing and maintaining scientific leaderboards amid exploding publication volume by introducing SciLead, a manually curated dataset with exhaustive Task-Dataset-Metric-Result ($TDMR$) annotations across 27 leaderboards from 43 NLP papers. It proposes a three-stage framework based on Retrieval-Augmented Generation to extract $TDMR$ tuples, normalize them to a taxonomy or dynamically create new leaderboards, and rank papers to form leaderboards. The authors evaluate multiple LLMs under fully pre-defined, partially pre-defined, and cold-start normalization settings, using ETM and IIM as well as leaderboard-level metrics (Recall, Precision, F1, PC, RC, AO). Results show that while LLMs are strong at identifying $TDM$ names and leaderboard structures, they struggle with extracting exact result values, with GPT-4 Turbo typically performing best in realistic scenarios and even showing robustness in cold-start cases. The work demonstrates practical potential for scalable, automated benchmarking and provides insights into the remaining bottlenections and directions for broadening coverage and improving result extraction.

Abstract

Scientific leaderboards are standardized ranking systems that facilitate evaluating and comparing competitive methods. Typically, a leaderboard is defined by a task, dataset, and evaluation metric (TDM) triple, allowing objective performance assessment and fostering innovation through benchmarking. However, the exponential increase in publications has made it infeasible to construct and maintain these leaderboards manually. Automatic leaderboard construction has emerged as a solution to reduce manual labor. Existing datasets for this task are based on the community-contributed leaderboards without additional curation. Our analysis shows that a large portion of these leaderboards are incomplete, and some of them contain incorrect information. In this work, we present SciLead, a manually-curated Scientific Leaderboard dataset that overcomes the aforementioned problems. Building on this dataset, we propose three experimental settings that simulate real-world scenarios where TDM triples are fully defined, partially defined, or undefined during leaderboard construction. While previous research has only explored the first setting, the latter two are more representative of real-world applications. To address these diverse settings, we develop a comprehensive LLM-based framework for constructing leaderboards. Our experiments and analysis reveal that various LLMs often correctly identify TDM triples while struggling to extract result values from publications. We make our code and data publicly available.

Efficient Performance Tracking: Leveraging Large Language Models for Automated Construction of Scientific Leaderboards

TL;DR

) annotations across 27 leaderboards from 43 NLP papers. It proposes a three-stage framework based on Retrieval-Augmented Generation to extract

tuples, normalize them to a taxonomy or dynamically create new leaderboards, and rank papers to form leaderboards. The authors evaluate multiple LLMs under fully pre-defined, partially pre-defined, and cold-start normalization settings, using ETM and IIM as well as leaderboard-level metrics (Recall, Precision, F1, PC, RC, AO). Results show that while LLMs are strong at identifying

names and leaderboard structures, they struggle with extracting exact result values, with GPT-4 Turbo typically performing best in realistic scenarios and even showing robustness in cold-start cases. The work demonstrates practical potential for scalable, automated benchmarking and provides insights into the remaining bottlenections and directions for broadening coverage and improving result extraction.

Abstract

Paper Structure (39 sections, 5 equations, 2 figures, 24 tables, 1 algorithm)

This paper contains 39 sections, 5 equations, 2 figures, 24 tables, 1 algorithm.

Introduction
Related Work
Scientific Leaderboard Construction.
Scientific Knowledge Graph Construction.
Our SciLead Dataset
Framework
TDMR Extraction
Normalization
Fully Pre-defined TDM Triples.
Partially Pre-defined TDM Triples.
Cold Start.
Leaderboard Construction
Experiments
Experimental Settings
Evaluation Settings
...and 24 more sections

Figures (2)

Figure 1: We first extract task, dataset, metric, and result (TDMR) tuples from scientific publications. Then, we update existing leaderboards of the same TDM (purple and blue). Different from previous work, we also construct a new leaderboard on demand (green).
Figure 2: Our framework in three steps: (1) TDMR Extraction, (2) Normalization, (3) Leaderboard Construction

Efficient Performance Tracking: Leveraging Large Language Models for Automated Construction of Scientific Leaderboards

TL;DR

Abstract

Efficient Performance Tracking: Leveraging Large Language Models for Automated Construction of Scientific Leaderboards

Authors

TL;DR

Abstract

Table of Contents

Figures (2)