Exploring the Latest LLMs for Leaderboard Extraction

Salomon Kabongo; Jennifer D'Souza; Sören Auer

Exploring the Latest LLMs for Leaderboard Extraction

Salomon Kabongo, Jennifer D'Souza, Sören Auer

TL;DR

The paper tackles automated leaderboard extraction from AI research by comparing open-source (Mistral-7B, Llama-2-7B) and proprietary (GPT-4-Turbo, GPT-4.o) LLMs across three input contexts: DocTAET, DocREC, and DocFULL. It evaluates the models on extracting $(T, D, M, S)$ quadruples using ROUGE-based summaries and precise classification/NER metrics, with few-shot and zero-shot settings and a large, annotated corpus derived from the PwC SOTA dataset. Key findings show open-source models can match or surpass proprietary ones in several contexts, especially with carefully designed input cues (DocTAET) and prompts, while DocREC excels where precision matters. The work demonstrates the practical viability of automated leaderboard construction and provides guidance for future context-design, domain adaptation, and hybrid-context approaches in scholarly IE. $T$, $D$, $M$, $S$ quadruples and their robust extraction are central to maintaining up-to-date, reliable leaderboards across AI subfields.

Abstract

The rapid advancements in Large Language Models (LLMs) have opened new avenues for automating complex tasks in AI research. This paper investigates the efficacy of different LLMs-Mistral 7B, Llama-2, GPT-4-Turbo and GPT-4.o in extracting leaderboard information from empirical AI research articles. We explore three types of contextual inputs to the models: DocTAET (Document Title, Abstract, Experimental Setup, and Tabular Information), DocREC (Results, Experiments, and Conclusions), and DocFULL (entire document). Our comprehensive study evaluates the performance of these models in generating (Task, Dataset, Metric, Score) quadruples from research papers. The findings reveal significant insights into the strengths and limitations of each model and context type, providing valuable guidance for future AI research automation efforts.

Exploring the Latest LLMs for Leaderboard Extraction

TL;DR

quadruples using ROUGE-based summaries and precise classification/NER metrics, with few-shot and zero-shot settings and a large, annotated corpus derived from the PwC SOTA dataset. Key findings show open-source models can match or surpass proprietary ones in several contexts, especially with carefully designed input cues (DocTAET) and prompts, while DocREC excels where precision matters. The work demonstrates the practical viability of automated leaderboard construction and provides guidance for future context-design, domain adaptation, and hybrid-context approaches in scholarly IE.

quadruples and their robust extraction are central to maintaining up-to-date, reliable leaderboards across AI subfields.

Abstract

Paper Structure (14 sections, 3 figures, 8 tables)

This paper contains 14 sections, 3 figures, 8 tables.

Introduction
Related Work
Methodology
Data Collection and Preprocessing
DocTAET
DocREC
DocFULL
Models
Evaluations
Results and Discussion
Conclusions
Instructions: Qualitative Examples
ROUGE Evaluation Metrics
Additional Data statistics and Hyperparameters

Figures (3)

Figure 1: Main process overview
Figure 2: DocTAET representation of the paper title "Deformable Convolutions and LSTM-based Flexible Event Frame Fusion Network for Motion Deblurring" With Dashed lines representing Task, Dataset, Metrics, and Score present in the paper but not captured by paper with codes
Figure 3: DocREC representation of the paper title "Deformable Convolutions and LSTM-based Flexible Event Frame Fusion Network for Motion Deblurring". With Dashed lines representing Task, Dataset, Metrics, and Score present in the paper but not captured by paper with codes

Exploring the Latest LLMs for Leaderboard Extraction

TL;DR

Abstract

Exploring the Latest LLMs for Leaderboard Extraction

Authors

TL;DR

Abstract

Table of Contents

Figures (3)