Table of Contents
Fetching ...

DENIAHL: In-Context Features Influence LLM Needle-In-A-Haystack Abilities

Hui Dai, Dan Pechi, Xinyi Yang, Garvit Banga, Raghav Mantri

TL;DR

DENIAHL introduces a data-centric framework to dissect Needle-in-a-haystack recall in long-context LLMs by systematically varying data size, patterns, and data type. The study compares GPT-3.5 and LLaMA-2 7B, showing that recall is highly data-dependent and influenced by primacy biases, with numbers versus letters yielding different loss patterns (lost-in-the-middle vs lost-at-the-end). It demonstrates that global pattern reliance is limited in some models and that longer, mixed-type content can significantly degrade recall, challenging assumptions that context length alone governs NIAH performance. The findings have practical implications for real-world long-context use and suggest strategies like reranking or RAG to bolster recall in production systems.

Abstract

The Needle-in-a-haystack (NIAH) test is a general task used to assess language models' (LMs') abilities to recall particular information from long input context. This framework however does not provide a means of analyzing what factors, beyond context length, contribute to LMs' abilities or inabilities to separate and recall needles from their haystacks. To provide a systematic means of assessing what features contribute to LMs' NIAH capabilities, we developed a synthetic benchmark called DENIAHL (Data-oriented Evaluation of NIAH for LLM's). Our work expands on previous NIAH studies by ablating NIAH features beyond typical context length including data type, size, and patterns. We find stark differences between GPT-3.5 and LLaMA 2-7B's performance on DENIAHL, and drops in recall performance when features like item size are increased, and to some degree when data type is changed from numbers to letters. This has implications for increasingly large context models, demonstrating factors beyond item-number impact NIAH capabilities.

DENIAHL: In-Context Features Influence LLM Needle-In-A-Haystack Abilities

TL;DR

DENIAHL introduces a data-centric framework to dissect Needle-in-a-haystack recall in long-context LLMs by systematically varying data size, patterns, and data type. The study compares GPT-3.5 and LLaMA-2 7B, showing that recall is highly data-dependent and influenced by primacy biases, with numbers versus letters yielding different loss patterns (lost-in-the-middle vs lost-at-the-end). It demonstrates that global pattern reliance is limited in some models and that longer, mixed-type content can significantly degrade recall, challenging assumptions that context length alone governs NIAH performance. The findings have practical implications for real-world long-context use and suggest strategies like reranking or RAG to bolster recall in production systems.

Abstract

The Needle-in-a-haystack (NIAH) test is a general task used to assess language models' (LMs') abilities to recall particular information from long input context. This framework however does not provide a means of analyzing what factors, beyond context length, contribute to LMs' abilities or inabilities to separate and recall needles from their haystacks. To provide a systematic means of assessing what features contribute to LMs' NIAH capabilities, we developed a synthetic benchmark called DENIAHL (Data-oriented Evaluation of NIAH for LLM's). Our work expands on previous NIAH studies by ablating NIAH features beyond typical context length including data type, size, and patterns. We find stark differences between GPT-3.5 and LLaMA 2-7B's performance on DENIAHL, and drops in recall performance when features like item size are increased, and to some degree when data type is changed from numbers to letters. This has implications for increasingly large context models, demonstrating factors beyond item-number impact NIAH capabilities.

Paper Structure

This paper contains 28 sections, 6 figures, 1 table.

Figures (6)

  • Figure 1: In Data-oriented Evaluation of Needle in A Haystack for LLM's (DENIAHL), we demonstrate properties of LLMs' context data strongly influence performance on Needle-in-a-haystack tasks. In addition to showing more robust performance in GPT-3.5 vs LLaMA-2 7B, we find that context containing numbers present typical "lost-in-the-middle" phenomena, whereas letter data is also poorly recalled at the end of the LLM's input context.
  • Figure 2: LLaMA-2 7B exhibits the "lost-in-the-middle" effect, where accuracy is higher when the target key-value pairs are at the beginning or end of the input context versus in the middle of the context.
  • Figure 3: ROUGE scores are generally strong for both models on the Needle-in-a-haystack benchmark niah.
  • Figure 4: Varying data size by changing total number of key-value pairs and number of items influence LLaMA-2 7B performance
  • Figure 5: Key-value pairs following numerical or letter patterns demonstrate models' preferences for local vs global information
  • ...and 1 more figures