Needle Threading: Can LLMs Follow Threads through Near-Million-Scale Haystacks?
Jonathan Roberts, Kai Han, Samuel Albanie
TL;DR
This work probes how state-of-the-art LLMs exploit very long contexts by developing needle-threading retrieval tasks that require following information across scattered sources. Using synthetic haystacks and 17 frontier models, the study reveals that models can thread information across lots of tokens but their practical effective context length is typically well below the advertised limit, with accuracy deteriorating in longer contexts. It also uncovers that threading direction matters, models can keep track of multiple threads (thread-safety), and that tokenization differences drastically affect token counts, motivating a model-agnostic effective context length metric. The findings offer practical guidance for long-context deployment and provide a rich dataset and tasks for the community to benchmark and improve how LLMs utilize context.
Abstract
As the context limits of Large Language Models (LLMs) increase, the range of possible applications and downstream functions broadens. In many real-world tasks, decisions depend on details scattered across collections of often disparate documents containing mostly irrelevant information. Long-context LLMs appear well-suited to this form of complex information retrieval and reasoning, which has traditionally proven costly and time-consuming. However, although the development of longer context models has seen rapid gains in recent years, our understanding of how effectively LLMs use their context has not kept pace. To address this, we conduct a set of retrieval experiments designed to evaluate the capabilities of 17 leading LLMs, such as their ability to follow threads of information through the context window. Strikingly, we find that many models are remarkably threadsafe: capable of simultaneously following multiple threads without significant loss in performance. Still, for many models, we find the effective context limit is significantly shorter than the supported context length, with accuracy decreasing as the context window grows. Our study also highlights the important point that token counts from different tokenizers should not be directly compared -- they often correspond to substantially different numbers of written characters. We release our code and long-context experimental data.
