Who Gets Cited Most? Benchmarking Long-Context Reasoning on Scientific Articles
Miao Li, Alexander Gurung, Irina Saparina, Mirella Lapata
TL;DR
SciTrek introduces a scalable, SQL-grounded benchmark to evaluate long-context reasoning on full-text scientific articles, enabling explicit reasoning traces via database queries. By assembling 64K–1M token collections and deriving ground-truth answers from structured article metadata, it enables automatic generation of natural-language questions and verifiable answers. Empirical results show frontier open-weight and proprietary LLMs struggle with long contexts, with only modest gains from supervised fine-tuning and reinforcement learning. The benchmark supports fine-grained error analysis, revealing systematic challenges in counting, numeric operations, and negation, and it generalizes to other domains with explicit entities and relations.
Abstract
We introduce SciTrek, a novel question-answering benchmark designed to evaluate long-context reasoning capabilities of large language models (LLMs) using scientific articles. Current long-context benchmarks often focus on simple information retrieval tasks, or employ artificial contexts. SciTrek addresses these limitations by creating benchmark questions that require information aggregation and synthesis across multiple full-text scientific articles. The questions and their ground-truth answers are automatically generated by formulating them as SQL queries over a database constructed from article metadata (i.e., titles, authors, and references). These SQL queries provide explicit, verifiable reasoning processes that enable fine-grained error analysis on model answers, and the data construction scales to contexts of up to 1M tokens with minimal supervision. Experiments on open-weight and proprietary LLMs show that SciTrek poses significant challenges as the context length increases, with supervised fine-tuning and reinforcement learning offering only limited gains. Our analysis reveals systematic shortcomings of frontier LLMs' ability to effectively perform numerical operations and accurately locate information in long contexts.
