U-NIAH: Unified RAG and LLM Evaluation for Long Context Needle-In-A-Haystack
Yunfan Gao, Yun Xiong, Wenlong Wu, Zijing Huang, Bohan Li, Haofen Wang
TL;DR
The paper tackles the lack of a unified evaluation for Retrieval-Augmented Generation (RAG) and large language models (LLMs) in long-context settings by introducing U-NIAH, a framework that unifies knit-together NIAH-style evaluations across RAG and LC-LM approaches. It extends the original NIAH with multi-needle, long-needle, and needle-in-needle configurations, diverse retrieval scopes, and a synthetic Starlight Academy dataset to eliminate pretraining knowledge bias. Through systematic experiments, it shows that RAG significantly benefits smaller LLMs and reduces the lost-in-the-middle phenomenon, but is sensitive to retrieval noise and chunk ordering, while advanced reasoning LLMs can be less RAG-compatible due to distractors. The work provides actionable insights for deploying RAG with long contexts, including when to prefer retrieval, how to organize retrieved chunks, and how to balance noise and context length; code is openly available for extension and replication.
Abstract
Recent advancements in Large Language Models (LLMs) have expanded their context windows to unprecedented lengths, sparking debates about the necessity of Retrieval-Augmented Generation (RAG). To address the fragmented evaluation paradigms and limited cases in existing Needle-in-a-Haystack (NIAH), this paper introduces U-NIAH, a unified framework that systematically compares LLMs and RAG methods in controlled long context settings. Our framework extends beyond traditional NIAH by incorporating multi-needle, long-needle, and needle-in-needle configurations, along with different retrieval settings, while leveraging the synthetic Starlight Academy dataset-a fictional magical universe-to eliminate biases from pre-trained knowledge. Through extensive experiments, we investigate three research questions: (1) performance trade-offs between LLMs and RAG, (2) error patterns in RAG, and (3) RAG's limitations in complex settings. Our findings show that RAG significantly enhances smaller LLMs by mitigating the "lost-in-the-middle" effect and improving robustness, achieving an 82.58% win-rate over LLMs. However, we observe that retrieval noise and reverse chunk ordering degrade performance, while surprisingly, advanced reasoning LLMs exhibit reduced RAG compatibility due to sensitivity to semantic distractors. We identify typical error patterns including omission due to noise, hallucination under high noise critical condition, and self-doubt behaviors. Our work not only highlights the complementary roles of RAG and LLMs, but also provides actionable insights for optimizing deployments. Code: https://github.com/Tongji-KGLLM/U-NIAH.
