Table of Contents
Fetching ...

Haystack Engineering: Context Engineering for Heterogeneous and Agentic Long-Context Evaluation

Mufei Li, Dongqi Fu, Limei Wang, Si Zhang, Hanqing Zeng, Kaan Sancak, Ruizhong Qiu, Haoyu Wang, Xiaoxin He, Xavier Bresson, Yinglong Xia, Chonglin Sun, Pan Li

TL;DR

This work addresses the gap between synthetic needle-in-a-haystack tests and real-world long-context usage by exposing how retrieval heterogeneity and agentic workflows shape LLM performance. It introduces HaystackCraft, a NIAH benchmark built on the Wikipedia hyperlink network with multi-hop QA and a dynamic, LLМ-dependent evaluation that simulates agentic query refinement and self-reflection, enabling context sizes up to $128K$ tokens. Key findings show graph-based reranking via Personalized PageRank improves retrieval and mitigates harmful distractors, up to about $44\%$ gains, while dense retrievers can intensify distractors; dynamic tests reveal cascading self-distractions and poor early stopping even in advanced models. Thus, HaystackCraft offers a realistic testbed to measure and advance robust agentic long-context reasoning and to guide future evaluation design.

Abstract

Modern long-context large language models (LLMs) perform well on synthetic "needle-in-a-haystack" (NIAH) benchmarks, but such tests overlook how noisy contexts arise from biased retrieval and agentic workflows. We argue that haystack engineering is necessary to construct noisy long contexts that faithfully capture key real-world factors -- distraction from heterogeneous biased retrievers and cascading errors in agentic workflows -- to test models' long-context robustness. We instantiate it through HaystackCraft, a new NIAH benchmark built on the full English Wikipedia hyperlink network with multi-hop questions. HaystackCraft evaluates how heterogeneous retrieval strategies (e.g., sparse, dense, hybrid, and graph-based) affect distractor composition, haystack ordering, and downstream LLM performance. HaystackCraft further extends NIAH to dynamic, LLM-dependent settings that simulate agentic operations, where models refine queries, reflect on their past reasonings, and decide when to stop. Experiments with 15 long-context models show that (1) while stronger dense retrievers can introduce more challenging distractors, graph-based reranking simultaneously improves retrieval effectiveness and mitigates more harmful distractors; (2) in agentic tests, even advanced models like Gemini 2.5 Pro and GPT-5 suffer cascading failures from self-generated distractors or struggle to perform early stops. These results highlight persistent challenges in agentic long-context reasoning and establish HaystackCraft as a valuable testbed for future progress.

Haystack Engineering: Context Engineering for Heterogeneous and Agentic Long-Context Evaluation

TL;DR

This work addresses the gap between synthetic needle-in-a-haystack tests and real-world long-context usage by exposing how retrieval heterogeneity and agentic workflows shape LLM performance. It introduces HaystackCraft, a NIAH benchmark built on the Wikipedia hyperlink network with multi-hop QA and a dynamic, LLМ-dependent evaluation that simulates agentic query refinement and self-reflection, enabling context sizes up to tokens. Key findings show graph-based reranking via Personalized PageRank improves retrieval and mitigates harmful distractors, up to about gains, while dense retrievers can intensify distractors; dynamic tests reveal cascading self-distractions and poor early stopping even in advanced models. Thus, HaystackCraft offers a realistic testbed to measure and advance robust agentic long-context reasoning and to guide future evaluation design.

Abstract

Modern long-context large language models (LLMs) perform well on synthetic "needle-in-a-haystack" (NIAH) benchmarks, but such tests overlook how noisy contexts arise from biased retrieval and agentic workflows. We argue that haystack engineering is necessary to construct noisy long contexts that faithfully capture key real-world factors -- distraction from heterogeneous biased retrievers and cascading errors in agentic workflows -- to test models' long-context robustness. We instantiate it through HaystackCraft, a new NIAH benchmark built on the full English Wikipedia hyperlink network with multi-hop questions. HaystackCraft evaluates how heterogeneous retrieval strategies (e.g., sparse, dense, hybrid, and graph-based) affect distractor composition, haystack ordering, and downstream LLM performance. HaystackCraft further extends NIAH to dynamic, LLM-dependent settings that simulate agentic operations, where models refine queries, reflect on their past reasonings, and decide when to stop. Experiments with 15 long-context models show that (1) while stronger dense retrievers can introduce more challenging distractors, graph-based reranking simultaneously improves retrieval effectiveness and mitigates more harmful distractors; (2) in agentic tests, even advanced models like Gemini 2.5 Pro and GPT-5 suffer cascading failures from self-generated distractors or struggle to perform early stops. These results highlight persistent challenges in agentic long-context reasoning and establish HaystackCraft as a valuable testbed for future progress.

Paper Structure

This paper contains 27 sections, 6 figures, 18 tables.

Figures (6)

  • Figure 1: Overview of the core challenges that HaystackCraft addresses. (a) Retrieval-Dependent Haystacks. The composition and ordering of the noisy long context ("haystack") are shaped by the retrieval strategy (e.g., sparse, dense, hybrid, and graph-based). (b) Agentic Error Propagation. In dynamic agentic workflows, early errors—such as misidentifying John Dury's death place—can propagate through query refinements. This leads to cascading failures where the agent deviates from the original query's intent and inflates distractor rankings.
  • Figure 2: (1) Retrieval performance improves as # retrieved documents $(N)$ increases. (2) Multi-hop questions pose larger retrieval challenges. (3) Reranking with PPR consistently boosts performance, especially for multi-hop questions. See Appendix \ref{['sec:detailed_retrieval_eval']} for the raw numbers.
  • Figure 3: Impact of retrieval strategy on NIAH performance as context size increases. $0$ stands for the case without distractors. All models experience a performance drop as context size increases. Graph-based reranking (dashed lines) consistently improves performance for larger context sizes. See Appendix \ref{['appendix:static_NIAH_rank_order']} for the raw numbers.
  • Figure 4: F1 score difference between retriever-ranked and random haystack orderings. The ordering impact is highly model-dependent. The Gemma-3 and Qwen2.5-1M families derive a significant and growing benefit from retriever-ranked ordering as context size expands. See Appendix \ref{['appendix:static_NIAH_random']} for the raw NIAH performance numbers with random haystack orderings.
  • Figure 5: Dynamic NIAH performance. $0$ stands for the case without distractors. (1) Enforced multi-round reasoning leads to performance drop. (2) Models are generally more robust to wider contexts than deeper reasoning. (3) Models fail to perform early stop properly (variable-round). For raw experiment numbers, see Appendix \ref{['appendix:dynamic_NIAH_raw']}.
  • ...and 1 more figures