Table of Contents
Fetching ...

NoLiMa: Long-Context Evaluation Beyond Literal Matching

Ali Modarressi, Hanieh Deilamsalehy, Franck Dernoncourt, Trung Bui, Ryan A. Rossi, Seunghyun Yoon, Hinrich Schütze

TL;DR

NoLiMa exposes the limitations of long-context evaluation methods that rely on literal overlaps by designing a needle-in-a-haystack benchmark with minimal lexical overlap between questions and needles. It assesses latent association reasoning across 13 LLMs, revealing substantial performance degradation as context length increases and showing that even Chain-of-Thought prompting offers only partial gains for multi-hop scenarios. Through systematic filtering, large-scale haystack construction, and detailed ablations, the study demonstrates that current models struggle to locate latent cues in long contexts, calling for benchmarks and architectures that prioritize associative reasoning and robust attention. The work provides a publicly available dataset and evaluation code to drive future improvements in long-context understanding and practical deployments in search and retrieval-augmented systems.

Abstract

Recent large language models (LLMs) support long contexts ranging from 128K to 1M tokens. A popular method for evaluating these capabilities is the needle-in-a-haystack (NIAH) test, which involves retrieving a "needle" (relevant information) from a "haystack" (long irrelevant context). Extensions of this approach include increasing distractors, fact chaining, and in-context reasoning. However, in these benchmarks, models can exploit existing literal matches between the needle and haystack to simplify the task. To address this, we introduce NoLiMa, a benchmark extending NIAH with a carefully designed needle set, where questions and needles have minimal lexical overlap, requiring models to infer latent associations to locate the needle within the haystack. We evaluate 13 popular LLMs that claim to support contexts of at least 128K tokens. While they perform well in short contexts (<1K), performance degrades significantly as context length increases. At 32K, for instance, 11 models drop below 50% of their strong short-length baselines. Even GPT-4o, one of the top-performing exceptions, experiences a reduction from an almost-perfect baseline of 99.3% to 69.7%. Our analysis suggests these declines stem from the increased difficulty the attention mechanism faces in longer contexts when literal matches are absent, making it harder to retrieve relevant information. Even models enhanced with reasoning capabilities or CoT prompting struggle to maintain performance in long contexts. We publicly release the dataset and evaluation code at https://github.com/adobe-research/NoLiMa.

NoLiMa: Long-Context Evaluation Beyond Literal Matching

TL;DR

NoLiMa exposes the limitations of long-context evaluation methods that rely on literal overlaps by designing a needle-in-a-haystack benchmark with minimal lexical overlap between questions and needles. It assesses latent association reasoning across 13 LLMs, revealing substantial performance degradation as context length increases and showing that even Chain-of-Thought prompting offers only partial gains for multi-hop scenarios. Through systematic filtering, large-scale haystack construction, and detailed ablations, the study demonstrates that current models struggle to locate latent cues in long contexts, calling for benchmarks and architectures that prioritize associative reasoning and robust attention. The work provides a publicly available dataset and evaluation code to drive future improvements in long-context understanding and practical deployments in search and retrieval-augmented systems.

Abstract

Recent large language models (LLMs) support long contexts ranging from 128K to 1M tokens. A popular method for evaluating these capabilities is the needle-in-a-haystack (NIAH) test, which involves retrieving a "needle" (relevant information) from a "haystack" (long irrelevant context). Extensions of this approach include increasing distractors, fact chaining, and in-context reasoning. However, in these benchmarks, models can exploit existing literal matches between the needle and haystack to simplify the task. To address this, we introduce NoLiMa, a benchmark extending NIAH with a carefully designed needle set, where questions and needles have minimal lexical overlap, requiring models to infer latent associations to locate the needle within the haystack. We evaluate 13 popular LLMs that claim to support contexts of at least 128K tokens. While they perform well in short contexts (<1K), performance degrades significantly as context length increases. At 32K, for instance, 11 models drop below 50% of their strong short-length baselines. Even GPT-4o, one of the top-performing exceptions, experiences a reduction from an almost-perfect baseline of 99.3% to 69.7%. Our analysis suggests these declines stem from the increased difficulty the attention mechanism faces in longer contexts when literal matches are absent, making it harder to retrieve relevant information. Even models enhanced with reasoning capabilities or CoT prompting struggle to maintain performance in long contexts. We publicly release the dataset and evaluation code at https://github.com/adobe-research/NoLiMa.

Paper Structure

This paper contains 25 sections, 6 figures, 12 tables.

Figures (6)

  • Figure 1: Haystack filtering pipeline for undesired or misleading content
  • Figure 2: Impact of (a) number of hops and (b) inversion of order ("[CHAR] … $W_n$" vs. "$W_n$ … CHAR") on normalized performance across GPT-4o and Llama 3.3 70B models. The red dotted line indicates the 0.85 effective threshold.
  • Figure 3: The full sweep plots (a & b) illustrate performance across the entire context window, where 0% corresponds to the beginning of the haystack and 100% to the end. The plots for the last 2K tokens (c & d) depict performance when needle placements are aligned within that range for various context lengths; 0 marks the end of the context, and larger values indicate positions farther from the end (up to 2K tokens inward). The color shading of each plot line represents the tested context length. To minimize noise and highlight trends more clearly, we increased the number of placements from 26 to 51 and applied a moving average with a window size of 12.
  • Figure 4: Needle placements in full sweep (top) vs. last 2K tokens sweep (bottom): In the last 2K setup, placement positions are aligned in different context lengths, unlike the proportion-based positioning in full sweep.
  • Figure 5: Normalized performance comparison across GPT-4o and Llama 3.3 70B models, with and without distractors. The red dotted line marks the 0.85 effective threshold.
  • ...and 1 more figures