Long Context RAG Performance of Large Language Models

Quinn Leng; Jacob Portes; Sam Havens; Matei Zaharia; Michael Carbin

Long Context RAG Performance of Large Language Models

Quinn Leng, Jacob Portes, Sam Havens, Matei Zaharia, Michael Carbin

TL;DR

The paper examines whether extending context length in LLMs enhances Retrieval Augmented Generation (RAG) performance. It benchmarks 20 models across three datasets with context lengths from 2k to 128k tokens (up to 2M for Gemini) and analyzes how recall from retrieval translates into answer quality via an LLM-based judge. Key findings show non-uniform benefits: several recent SOTA models sustain or improve accuracy at long contexts, while many open-source models degrade beyond tens of thousands of tokens, accompanied by varied failure modes such as refusals and safety filters. The work highlights practical implications for system design, including when full-context feeding might be feasible vs. relying on retrieval, and outlines directions for addressing alignment and safety challenges in long-context RAG settings.

Abstract

Retrieval Augmented Generation (RAG) has emerged as a crucial technique for enhancing the accuracy of Large Language Models (LLMs) by incorporating external information. With the advent of LLMs that support increasingly longer context lengths, there is a growing interest in understanding how these models perform in RAG scenarios. Can these new long context models improve RAG performance? This paper presents a comprehensive study of the impact of increased context length on RAG performance across 20 popular open source and commercial LLMs. We ran RAG workflows while varying the total context length from 2,000 to 128,000 tokens (and 2 million tokens when possible) on three domain-specific datasets, and report key insights on the benefits and limitations of long context in RAG applications. Our findings reveal that while retrieving more documents can improve performance, only a handful of the most recent state of the art LLMs can maintain consistent accuracy at long context above 64k tokens. We also identify distinct failure modes in long context scenarios, suggesting areas for future research.

Long Context RAG Performance of Large Language Models

TL;DR

Abstract

Long Context RAG Performance of Large Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)