Table of Contents
Fetching ...

Long Context RAG Performance of Large Language Models

Quinn Leng, Jacob Portes, Sam Havens, Matei Zaharia, Michael Carbin

TL;DR

The paper examines whether extending context length in LLMs enhances Retrieval Augmented Generation (RAG) performance. It benchmarks 20 models across three datasets with context lengths from 2k to 128k tokens (up to 2M for Gemini) and analyzes how recall from retrieval translates into answer quality via an LLM-based judge. Key findings show non-uniform benefits: several recent SOTA models sustain or improve accuracy at long contexts, while many open-source models degrade beyond tens of thousands of tokens, accompanied by varied failure modes such as refusals and safety filters. The work highlights practical implications for system design, including when full-context feeding might be feasible vs. relying on retrieval, and outlines directions for addressing alignment and safety challenges in long-context RAG settings.

Abstract

Retrieval Augmented Generation (RAG) has emerged as a crucial technique for enhancing the accuracy of Large Language Models (LLMs) by incorporating external information. With the advent of LLMs that support increasingly longer context lengths, there is a growing interest in understanding how these models perform in RAG scenarios. Can these new long context models improve RAG performance? This paper presents a comprehensive study of the impact of increased context length on RAG performance across 20 popular open source and commercial LLMs. We ran RAG workflows while varying the total context length from 2,000 to 128,000 tokens (and 2 million tokens when possible) on three domain-specific datasets, and report key insights on the benefits and limitations of long context in RAG applications. Our findings reveal that while retrieving more documents can improve performance, only a handful of the most recent state of the art LLMs can maintain consistent accuracy at long context above 64k tokens. We also identify distinct failure modes in long context scenarios, suggesting areas for future research.

Long Context RAG Performance of Large Language Models

TL;DR

The paper examines whether extending context length in LLMs enhances Retrieval Augmented Generation (RAG) performance. It benchmarks 20 models across three datasets with context lengths from 2k to 128k tokens (up to 2M for Gemini) and analyzes how recall from retrieval translates into answer quality via an LLM-based judge. Key findings show non-uniform benefits: several recent SOTA models sustain or improve accuracy at long contexts, while many open-source models degrade beyond tens of thousands of tokens, accompanied by varied failure modes such as refusals and safety filters. The work highlights practical implications for system design, including when full-context feeding might be feasible vs. relying on retrieval, and outlines directions for addressing alignment and safety challenges in long-context RAG settings.

Abstract

Retrieval Augmented Generation (RAG) has emerged as a crucial technique for enhancing the accuracy of Large Language Models (LLMs) by incorporating external information. With the advent of LLMs that support increasingly longer context lengths, there is a growing interest in understanding how these models perform in RAG scenarios. Can these new long context models improve RAG performance? This paper presents a comprehensive study of the impact of increased context length on RAG performance across 20 popular open source and commercial LLMs. We ran RAG workflows while varying the total context length from 2,000 to 128,000 tokens (and 2 million tokens when possible) on three domain-specific datasets, and report key insights on the benefits and limitations of long context in RAG applications. Our findings reveal that while retrieving more documents can improve performance, only a handful of the most recent state of the art LLMs can maintain consistent accuracy at long context above 64k tokens. We also identify distinct failure modes in long context scenarios, suggesting areas for future research.

Paper Structure

This paper contains 28 sections, 6 figures, 13 tables.

Figures (6)

  • Figure 1: Long context RAG performance of o1, GPT-4, Claude 3/3.5, Gemini 1.5 (gemini-1.5-pro-001 and gemini-1.5-flash-001), Llama 3/3.1, Qwen 2, Mistral and DBRX models on 3 curated RAG datasets (Databricks DocsQA, FinanceBench, and Natural Questions). All values can be found in Table \ref{['appendix:results-table']}. Model versions are listed in Table \ref{['appendix:llm-models']}.
  • Figure 2: Long context RAG performance on FinanceBench
  • Figure 3: Failure analysis on the Natural Questions (NQ) dataset for Gemini 1.5 Pro, Claude 3 Sonnet, Mixtral 8x7B, and Llama 3.1 405B. Gemini 1.5 Pro (gemini-1.5-pro-001) increasingly failed tasks at long context length due to overly sensitive safety filters, while Claude 3 Sonnet frequently refused to answer due to percieved copyright concerns.
  • Figure S1: Long context RAG performance on Databricks DocsQA.
  • Figure S2: Long context RAG performance on Natural Questions
  • ...and 1 more figures