Table of Contents
Fetching ...

Who Gets Cited Most? Benchmarking Long-Context Reasoning on Scientific Articles

Miao Li, Alexander Gurung, Irina Saparina, Mirella Lapata

TL;DR

SciTrek introduces a scalable, SQL-grounded benchmark to evaluate long-context reasoning on full-text scientific articles, enabling explicit reasoning traces via database queries. By assembling 64K–1M token collections and deriving ground-truth answers from structured article metadata, it enables automatic generation of natural-language questions and verifiable answers. Empirical results show frontier open-weight and proprietary LLMs struggle with long contexts, with only modest gains from supervised fine-tuning and reinforcement learning. The benchmark supports fine-grained error analysis, revealing systematic challenges in counting, numeric operations, and negation, and it generalizes to other domains with explicit entities and relations.

Abstract

We introduce SciTrek, a novel question-answering benchmark designed to evaluate long-context reasoning capabilities of large language models (LLMs) using scientific articles. Current long-context benchmarks often focus on simple information retrieval tasks, or employ artificial contexts. SciTrek addresses these limitations by creating benchmark questions that require information aggregation and synthesis across multiple full-text scientific articles. The questions and their ground-truth answers are automatically generated by formulating them as SQL queries over a database constructed from article metadata (i.e., titles, authors, and references). These SQL queries provide explicit, verifiable reasoning processes that enable fine-grained error analysis on model answers, and the data construction scales to contexts of up to 1M tokens with minimal supervision. Experiments on open-weight and proprietary LLMs show that SciTrek poses significant challenges as the context length increases, with supervised fine-tuning and reinforcement learning offering only limited gains. Our analysis reveals systematic shortcomings of frontier LLMs' ability to effectively perform numerical operations and accurately locate information in long contexts.

Who Gets Cited Most? Benchmarking Long-Context Reasoning on Scientific Articles

TL;DR

SciTrek introduces a scalable, SQL-grounded benchmark to evaluate long-context reasoning on full-text scientific articles, enabling explicit reasoning traces via database queries. By assembling 64K–1M token collections and deriving ground-truth answers from structured article metadata, it enables automatic generation of natural-language questions and verifiable answers. Empirical results show frontier open-weight and proprietary LLMs struggle with long contexts, with only modest gains from supervised fine-tuning and reinforcement learning. The benchmark supports fine-grained error analysis, revealing systematic challenges in counting, numeric operations, and negation, and it generalizes to other domains with explicit entities and relations.

Abstract

We introduce SciTrek, a novel question-answering benchmark designed to evaluate long-context reasoning capabilities of large language models (LLMs) using scientific articles. Current long-context benchmarks often focus on simple information retrieval tasks, or employ artificial contexts. SciTrek addresses these limitations by creating benchmark questions that require information aggregation and synthesis across multiple full-text scientific articles. The questions and their ground-truth answers are automatically generated by formulating them as SQL queries over a database constructed from article metadata (i.e., titles, authors, and references). These SQL queries provide explicit, verifiable reasoning processes that enable fine-grained error analysis on model answers, and the data construction scales to contexts of up to 1M tokens with minimal supervision. Experiments on open-weight and proprietary LLMs show that SciTrek poses significant challenges as the context length increases, with supervised fine-tuning and reinforcement learning offering only limited gains. Our analysis reveals systematic shortcomings of frontier LLMs' ability to effectively perform numerical operations and accurately locate information in long contexts.

Paper Structure

This paper contains 20 sections, 7 figures, 17 tables.

Figures (7)

  • Figure 1: Overview of SciTrek construction process: we gather article collections of varying scales; we then obtain SQL queries and their answers based on databases (which store collection-specific metadata); and finally we convert SQL queries to natural language questions. The final dataset consists of input contexts (full-text scientific articles), natural language questions, and their answers, all highlighted in blue.
  • Figure 2: Fine-grained analysis for best performing models in terms of exact match (EM). We examine how model performance varies across question topics (top), information processing skills (middle) and subjects (bottom) when using full-text scientific articles as context with an input length of 128K.
  • Figure 3: Prompt template for converting SQL queries to natural language questions.
  • Figure 4: Prompt template for converting natural language questions to SQL queries.
  • Figure 5: Prompt template using full-text articles as context.
  • ...and 2 more figures