Who Gets Cited Most? Benchmarking Long-Context Reasoning on Scientific Articles

Miao Li; Alexander Gurung; Irina Saparina; Mirella Lapata

Who Gets Cited Most? Benchmarking Long-Context Reasoning on Scientific Articles

Miao Li, Alexander Gurung, Irina Saparina, Mirella Lapata

TL;DR

SciTrek introduces a scalable, SQL-grounded benchmark to evaluate long-context reasoning on full-text scientific articles, enabling explicit reasoning traces via database queries. By assembling 64K–1M token collections and deriving ground-truth answers from structured article metadata, it enables automatic generation of natural-language questions and verifiable answers. Empirical results show frontier open-weight and proprietary LLMs struggle with long contexts, with only modest gains from supervised fine-tuning and reinforcement learning. The benchmark supports fine-grained error analysis, revealing systematic challenges in counting, numeric operations, and negation, and it generalizes to other domains with explicit entities and relations.

Abstract

We introduce SciTrek, a novel question-answering benchmark designed to evaluate long-context reasoning capabilities of large language models (LLMs) using scientific articles. Current long-context benchmarks often focus on simple information retrieval tasks, or employ artificial contexts. SciTrek addresses these limitations by creating benchmark questions that require information aggregation and synthesis across multiple full-text scientific articles. The questions and their ground-truth answers are automatically generated by formulating them as SQL queries over a database constructed from article metadata (i.e., titles, authors, and references). These SQL queries provide explicit, verifiable reasoning processes that enable fine-grained error analysis on model answers, and the data construction scales to contexts of up to 1M tokens with minimal supervision. Experiments on open-weight and proprietary LLMs show that SciTrek poses significant challenges as the context length increases, with supervised fine-tuning and reinforcement learning offering only limited gains. Our analysis reveals systematic shortcomings of frontier LLMs' ability to effectively perform numerical operations and accurately locate information in long contexts.

Who Gets Cited Most? Benchmarking Long-Context Reasoning on Scientific Articles

TL;DR

Abstract

Who Gets Cited Most? Benchmarking Long-Context Reasoning on Scientific Articles

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)