DENIAHL: In-Context Features Influence LLM Needle-In-A-Haystack Abilities
Hui Dai, Dan Pechi, Xinyi Yang, Garvit Banga, Raghav Mantri
TL;DR
DENIAHL introduces a data-centric framework to dissect Needle-in-a-haystack recall in long-context LLMs by systematically varying data size, patterns, and data type. The study compares GPT-3.5 and LLaMA-2 7B, showing that recall is highly data-dependent and influenced by primacy biases, with numbers versus letters yielding different loss patterns (lost-in-the-middle vs lost-at-the-end). It demonstrates that global pattern reliance is limited in some models and that longer, mixed-type content can significantly degrade recall, challenging assumptions that context length alone governs NIAH performance. The findings have practical implications for real-world long-context use and suggest strategies like reranking or RAG to bolster recall in production systems.
Abstract
The Needle-in-a-haystack (NIAH) test is a general task used to assess language models' (LMs') abilities to recall particular information from long input context. This framework however does not provide a means of analyzing what factors, beyond context length, contribute to LMs' abilities or inabilities to separate and recall needles from their haystacks. To provide a systematic means of assessing what features contribute to LMs' NIAH capabilities, we developed a synthetic benchmark called DENIAHL (Data-oriented Evaluation of NIAH for LLM's). Our work expands on previous NIAH studies by ablating NIAH features beyond typical context length including data type, size, and patterns. We find stark differences between GPT-3.5 and LLaMA 2-7B's performance on DENIAHL, and drops in recall performance when features like item size are increased, and to some degree when data type is changed from numbers to letters. This has implications for increasingly large context models, demonstrating factors beyond item-number impact NIAH capabilities.
