Table of Contents
Fetching ...

Is It Really Long Context if All You Need Is Retrieval? Towards Genuinely Difficult Long Context NLP

Omer Goldman, Alon Jacovi, Aviv Slobodkin, Aviya Maimon, Ido Dagan, Reut Tsarfaty

TL;DR

Long-context evaluation has been dominated by input length, which conflates diverse tasks. The authors propose a two-axis taxonomy—Dispersion and Scope—to distinguish difficulty beyond mere length, and they survey existing benchmarks, arguing that genuinely hard tasks (high dispersion and high scope) are under-explored. They analyze natural vs synthetic task construction and advocate for principled benchmark design that targets both axes, including domain-specific long-texts and structured data. This framework aims to guide the creation of more reliable, informative long-context benchmarks that reveal genuine model capability growth.

Abstract

Improvements in language models' capabilities have pushed their applications towards longer contexts, making long-context evaluation and development an active research area. However, many disparate use-cases are grouped together under the umbrella term of "long-context", defined simply by the total length of the model's input, including - for example - Needle-in-a-Haystack tasks, book summarization, and information aggregation. Given their varied difficulty, in this position paper we argue that conflating different tasks by their context length is unproductive. As a community, we require a more precise vocabulary to understand what makes long-context tasks similar or different. We propose to unpack the taxonomy of long-context based on the properties that make them more difficult with longer contexts. We propose two orthogonal axes of difficulty: (I) Diffusion: How hard is it to find the necessary information in the context? (II) Scope: How much necessary information is there to find? We survey the literature on long-context, provide justification for this taxonomy as an informative descriptor, and situate the literature with respect to it. We conclude that the most difficult and interesting settings, whose necessary information is very long and highly diffused within the input, is severely under-explored. By using a descriptive vocabulary and discussing the relevant properties of difficulty in long-context, we can implement more informed research in this area. We call for a careful design of tasks and benchmarks with distinctly long context, taking into account the characteristics that make it qualitatively different from shorter context.

Is It Really Long Context if All You Need Is Retrieval? Towards Genuinely Difficult Long Context NLP

TL;DR

Long-context evaluation has been dominated by input length, which conflates diverse tasks. The authors propose a two-axis taxonomy—Dispersion and Scope—to distinguish difficulty beyond mere length, and they survey existing benchmarks, arguing that genuinely hard tasks (high dispersion and high scope) are under-explored. They analyze natural vs synthetic task construction and advocate for principled benchmark design that targets both axes, including domain-specific long-texts and structured data. This framework aims to guide the creation of more reliable, informative long-context benchmarks that reveal genuine model capability growth.

Abstract

Improvements in language models' capabilities have pushed their applications towards longer contexts, making long-context evaluation and development an active research area. However, many disparate use-cases are grouped together under the umbrella term of "long-context", defined simply by the total length of the model's input, including - for example - Needle-in-a-Haystack tasks, book summarization, and information aggregation. Given their varied difficulty, in this position paper we argue that conflating different tasks by their context length is unproductive. As a community, we require a more precise vocabulary to understand what makes long-context tasks similar or different. We propose to unpack the taxonomy of long-context based on the properties that make them more difficult with longer contexts. We propose two orthogonal axes of difficulty: (I) Diffusion: How hard is it to find the necessary information in the context? (II) Scope: How much necessary information is there to find? We survey the literature on long-context, provide justification for this taxonomy as an informative descriptor, and situate the literature with respect to it. We conclude that the most difficult and interesting settings, whose necessary information is very long and highly diffused within the input, is severely under-explored. By using a descriptive vocabulary and discussing the relevant properties of difficulty in long-context, we can implement more informed research in this area. We call for a careful design of tasks and benchmarks with distinctly long context, taking into account the characteristics that make it qualitatively different from shorter context.
Paper Structure (22 sections, 2 figures, 1 table)

This paper contains 22 sections, 2 figures, 1 table.

Figures (2)

  • Figure 1: A taxonomy of long context tasks based on the distribution of the needed information in the text. Tasks with larger scope and higher dispersion are more difficult (indicated by shade) and more indicative of the long context capabilities of large language models.
  • Figure 2: This figure illustrates our subjective judgment on the distribution of long-context benchmarks for each task, categorized by their scope and dispersion characteristics, with the four quadrants being marked by the dashed lines. Difficulty is expressed by shade, where red is more difficult and green in easier. Notably, some tasks, like Question-answering (QA), appear in multiple quadrants, as different benchmarks demand varying levels of scope and dispersion (e.g., a single fact versus multiple facts spread across a document). For a detailed breakdown of benchmarks and their task associations, refer to \ref{['sec:benchmark_scope_dispersion_classification']}.