Table of Contents
Fetching ...

100-LongBench: Are de facto Long-Context Benchmarks Literally Evaluating Long-Context Ability?

Wang Yang, Hongye Jin, Shaochen Zhong, Song Jiang, Qifan Wang, Vipin Chaudhary, Xiaotian Han

TL;DR

The paper tackles the challenge of fairly evaluating long-context capabilities in LLMs by exposing weaknesses in fixed-length benchmarks that conflates base ability with long-context reasoning. It introduces 1f4af-LongBench, a length-controllable benchmark built from Real Context Sources and Noisy Context Sources across diverse tasks, paired with QA-filtering to mitigate prior-knowledge leakage. To disentangle long-context capability from baseline performance, it proposes LongScore, defined by $LC_l = \frac{S_l - Base Ability}{Base Ability}$ with $Base Ability = \frac{S_{2k} + S_{4k} + S_{6k}}{3}$, enabling unbiased cross-model comparisons over large contexts (up to $128k$ tokens and beyond). Empirical results on frontier open-source LLMs and domain-specific tasks demonstrate that LongScore yields more accurate prioritization of true long-context capabilities than traditional metrics, guiding future benchmark design and model development.

Abstract

Long-context capability is considered one of the most important abilities of LLMs, as a truly long context-capable LLM enables users to effortlessly process many originally exhausting tasks -- e.g., digesting a long-form document to find answers vs. directly asking an LLM about it. However, existing real-task-based long-context evaluation benchmarks have two major shortcomings. First, benchmarks like LongBench often do not provide proper metrics to separate long-context performance from the model's baseline ability, making cross-model comparison unclear. Second, such benchmarks are usually constructed with fixed input lengths, which limits their applicability across different models and fails to reveal when a model begins to break down. To address these issues, we introduce a length-controllable long-context benchmark and a novel metric that disentangles baseline knowledge from true long-context capabilities. Experiments demonstrate the superiority of our approach in effectively evaluating LLMs.

100-LongBench: Are de facto Long-Context Benchmarks Literally Evaluating Long-Context Ability?

TL;DR

The paper tackles the challenge of fairly evaluating long-context capabilities in LLMs by exposing weaknesses in fixed-length benchmarks that conflates base ability with long-context reasoning. It introduces 1f4af-LongBench, a length-controllable benchmark built from Real Context Sources and Noisy Context Sources across diverse tasks, paired with QA-filtering to mitigate prior-knowledge leakage. To disentangle long-context capability from baseline performance, it proposes LongScore, defined by with , enabling unbiased cross-model comparisons over large contexts (up to tokens and beyond). Empirical results on frontier open-source LLMs and domain-specific tasks demonstrate that LongScore yields more accurate prioritization of true long-context capabilities than traditional metrics, guiding future benchmark design and model development.

Abstract

Long-context capability is considered one of the most important abilities of LLMs, as a truly long context-capable LLM enables users to effortlessly process many originally exhausting tasks -- e.g., digesting a long-form document to find answers vs. directly asking an LLM about it. However, existing real-task-based long-context evaluation benchmarks have two major shortcomings. First, benchmarks like LongBench often do not provide proper metrics to separate long-context performance from the model's baseline ability, making cross-model comparison unclear. Second, such benchmarks are usually constructed with fixed input lengths, which limits their applicability across different models and fails to reveal when a model begins to break down. To address these issues, we introduce a length-controllable long-context benchmark and a novel metric that disentangles baseline knowledge from true long-context capabilities. Experiments demonstrate the superiority of our approach in effectively evaluating LLMs.

Paper Structure

This paper contains 20 sections, 2 equations, 10 figures, 8 tables.

Figures (10)

  • Figure 1: Illustration of LM-Infinite han2024lm, a long-context enhancement method's performances on three LongBench tasks. The colored dashed lines represent the average score of each model on the corresponding task. The size of the markers corresponds to the proportion of each text length within the entire dataset. The larger the marker, the higher the proportion. The results exhibit significant variation across tasks of different lengths within the same dataset. More results of other methods are in \ref{['Results of models’ long-text enhancement methods on Longbench']}.
  • Figure 2: Comparison between LLaMA 3.1-8B-Instruct and Qwen 2.5-7B-Instruct on the Counting Star task across varying text lengths. The dashed line represents the average score across all context lengths. LLaMA 3.1-8B-Instruct performs worse than Qwen 2.5-7B-Instruct on short texts but excels on extremely long texts, indicating its superior long-context extension capability.
  • Figure 3: Illustration of the Data Generation Process for the Single-Doc QA Task
  • Figure 4: One sample in Question Answering where models provide accurate answers regardless of context
  • Figure 5: Verification of the reliability of 1f4af-LongBench: results of two models of different sizes from the same LM family tree, showcasing their average scores in different tasks. These findings confirm a well-established trend: within the same series, larger models generally outperform smaller ones, reinforcing the reliability of 1f4af-LongBench.
  • ...and 5 more figures