100-LongBench: Are de facto Long-Context Benchmarks Literally Evaluating Long-Context Ability?
Wang Yang, Hongye Jin, Shaochen Zhong, Song Jiang, Qifan Wang, Vipin Chaudhary, Xiaotian Han
TL;DR
The paper tackles the challenge of fairly evaluating long-context capabilities in LLMs by exposing weaknesses in fixed-length benchmarks that conflates base ability with long-context reasoning. It introduces 1f4af-LongBench, a length-controllable benchmark built from Real Context Sources and Noisy Context Sources across diverse tasks, paired with QA-filtering to mitigate prior-knowledge leakage. To disentangle long-context capability from baseline performance, it proposes LongScore, defined by $LC_l = \frac{S_l - Base Ability}{Base Ability}$ with $Base Ability = \frac{S_{2k} + S_{4k} + S_{6k}}{3}$, enabling unbiased cross-model comparisons over large contexts (up to $128k$ tokens and beyond). Empirical results on frontier open-source LLMs and domain-specific tasks demonstrate that LongScore yields more accurate prioritization of true long-context capabilities than traditional metrics, guiding future benchmark design and model development.
Abstract
Long-context capability is considered one of the most important abilities of LLMs, as a truly long context-capable LLM enables users to effortlessly process many originally exhausting tasks -- e.g., digesting a long-form document to find answers vs. directly asking an LLM about it. However, existing real-task-based long-context evaluation benchmarks have two major shortcomings. First, benchmarks like LongBench often do not provide proper metrics to separate long-context performance from the model's baseline ability, making cross-model comparison unclear. Second, such benchmarks are usually constructed with fixed input lengths, which limits their applicability across different models and fails to reveal when a model begins to break down. To address these issues, we introduce a length-controllable long-context benchmark and a novel metric that disentangles baseline knowledge from true long-context capabilities. Experiments demonstrate the superiority of our approach in effectively evaluating LLMs.
