Exploring the Latest LLMs for Leaderboard Extraction
Salomon Kabongo, Jennifer D'Souza, Sören Auer
TL;DR
The paper tackles automated leaderboard extraction from AI research by comparing open-source (Mistral-7B, Llama-2-7B) and proprietary (GPT-4-Turbo, GPT-4.o) LLMs across three input contexts: DocTAET, DocREC, and DocFULL. It evaluates the models on extracting $(T, D, M, S)$ quadruples using ROUGE-based summaries and precise classification/NER metrics, with few-shot and zero-shot settings and a large, annotated corpus derived from the PwC SOTA dataset. Key findings show open-source models can match or surpass proprietary ones in several contexts, especially with carefully designed input cues (DocTAET) and prompts, while DocREC excels where precision matters. The work demonstrates the practical viability of automated leaderboard construction and provides guidance for future context-design, domain adaptation, and hybrid-context approaches in scholarly IE. $T$, $D$, $M$, $S$ quadruples and their robust extraction are central to maintaining up-to-date, reliable leaderboards across AI subfields.
Abstract
The rapid advancements in Large Language Models (LLMs) have opened new avenues for automating complex tasks in AI research. This paper investigates the efficacy of different LLMs-Mistral 7B, Llama-2, GPT-4-Turbo and GPT-4.o in extracting leaderboard information from empirical AI research articles. We explore three types of contextual inputs to the models: DocTAET (Document Title, Abstract, Experimental Setup, and Tabular Information), DocREC (Results, Experiments, and Conclusions), and DocFULL (entire document). Our comprehensive study evaluates the performance of these models in generating (Task, Dataset, Metric, Score) quadruples from research papers. The findings reveal significant insights into the strengths and limitations of each model and context type, providing valuable guidance for future AI research automation efforts.
