Instruction Finetuning for Leaderboard Generation from Empirical AI Research

Salomon Kabongo; Jennifer D'Souza

Instruction Finetuning for Leaderboard Generation from Empirical AI Research

Salomon Kabongo, Jennifer D'Souza

TL;DR

This work tackles automated leaderboard generation from empirical AI research by recasting (Task, Dataset, Metric, Score) extraction as an instruction-following generation problem. Using the FLAN-T5 Large model, the authors finetune on 15 templates drawn from SQuAD v2 and DROP, conditioning on a DocTEAT context to produce structured SOTA outputs for each article. The SOTA-Flan-T5 model delivers notable improvements in structured-summarization ROUGE scores and achieves high accuracy in identifying papers with leaderboards, outperforming prior NLI-based approaches and validating open-world extraction. The approach promises scalable, domain-adaptive leaderboard construction, with practical implications for dissemination, indexing, and benchmarking across AI research communities, while acknowledging data-processing limitations and the need for human validation in production deployments. Key extensions include broader domain generalization, refinement of Score extraction, and exploration of larger or multi-task instruction-tuned architectures to further close gaps with ground-truth leaderboards. $T$, $D$, $M$, and $S$ quadruples form the core structured outputs enabling richer, machine-actionable summaries of empirical AI progress.

Abstract

This study demonstrates the application of instruction finetuning of pretrained Large Language Models (LLMs) to automate the generation of AI research leaderboards, extracting (Task, Dataset, Metric, Score) quadruples from articles. It aims to streamline the dissemination of advancements in AI research by transitioning from traditional, manual community curation, or otherwise taxonomy-constrained natural language inference (NLI) models, to an automated, generative LLM-based approach. Utilizing the FLAN-T5 model, this research enhances LLMs' adaptability and reliability in information extraction, offering a novel method for structured knowledge representation.

Instruction Finetuning for Leaderboard Generation from Empirical AI Research

TL;DR

Abstract

Instruction Finetuning for Leaderboard Generation from Empirical AI Research

Authors

TL;DR

Abstract

Table of Contents