Effective Context Selection in LLM-based Leaderboard Generation: An Empirical Study

Salomon Kabongo; Jennifer D'Souza; Sören Auer

Effective Context Selection in LLM-based Leaderboard Generation: An Empirical Study

Salomon Kabongo, Jennifer D'Souza, Sören Auer

TL;DR

This work tackles extracting the state-of-the-art $(T, D, M, S)$ quadruples from AI articles to automate leaderboard generation. It reframes SOTA as a text-generation task and uses instruction finetuning on the FLAN-T5 collection, comparing three context types to mitigate hallucinations while maximizing extraction accuracy. The study shows that DocTAET context excels for structured summaries and leaderboard classification in few-shot and zero-shot settings, whereas DocREC best supports precise element extraction, with DocFULL being less effective in zero-shot scenarios. The results offer practical guidance for context-aware LLM information extraction and include a public code release to enable replication and broader application.

Abstract

This paper explores the impact of context selection on the efficiency of Large Language Models (LLMs) in generating Artificial Intelligence (AI) research leaderboards, a task defined as the extraction of (Task, Dataset, Metric, Score) quadruples from scholarly articles. By framing this challenge as a text generation objective and employing instruction finetuning with the FLAN-T5 collection, we introduce a novel method that surpasses traditional Natural Language Inference (NLI) approaches in adapting to new developments without a predefined taxonomy. Through experimentation with three distinct context types of varying selectivity and length, our study demonstrates the importance of effective context selection in enhancing LLM accuracy and reducing hallucinations, providing a new pathway for the reliable and efficient generation of AI leaderboards. This contribution not only advances the state of the art in leaderboard generation but also sheds light on strategies to mitigate common challenges in LLM-based information extraction.

Effective Context Selection in LLM-based Leaderboard Generation: An Empirical Study

TL;DR

This work tackles extracting the state-of-the-art

quadruples from AI articles to automate leaderboard generation. It reframes SOTA as a text-generation task and uses instruction finetuning on the FLAN-T5 collection, comparing three context types to mitigate hallucinations while maximizing extraction accuracy. The study shows that DocTAET context excels for structured summaries and leaderboard classification in few-shot and zero-shot settings, whereas DocREC best supports precise element extraction, with DocFULL being less effective in zero-shot scenarios. The results offer practical guidance for context-aware LLM information extraction and include a public code release to enable replication and broader application.

Abstract

Paper Structure (10 sections, 2 figures, 6 tables)

This paper contains 10 sections, 2 figures, 6 tables.

Introduction
Corpus
Approach
Models
Evaluations
Results and Discussion
Conclusions
Instructions: Qualitative Examples
ROUGE Evaluation Metrics
Additional Data statistics and Hyperparameters

Figures (2)

Figure 1: Two sets of (Task, Dataset, Metric, Score) tuples reported in an AI paper.
Figure 2: Main process overview

Effective Context Selection in LLM-based Leaderboard Generation: An Empirical Study

TL;DR

Abstract

Effective Context Selection in LLM-based Leaderboard Generation: An Empirical Study

Authors

TL;DR

Abstract

Table of Contents

Figures (2)