Table of Contents
Fetching ...

LaRA: Benchmarking Retrieval-Augmented Generation and Long-Context LLMs -- No Silver Bullet for LC or RAG Routing

Kuan Li, Liwen Zhang, Yong Jiang, Pengjun Xie, Fei Huang, Shuai Wang, Minhao Cheng

TL;DR

LaRA addresses the lack of clear guidance on when to use Retrieval-Augmented Generation (RAG) versus long-context (LC) LLMs for external knowledge. The authors construct a benchmark with 2326 test cases across novels, academic papers, and financial statements to compare RAG and LC under realistic long-context conditions. They show that the better approach depends on model size, context length, task type, and retrieved chunk characteristics, with RAG aiding weaker models and LC excelling for strong, long-context-capable models. The work provides actionable guidelines and accompanies code and data to advance practical RAG/LC routing decisions.

Abstract

Effectively incorporating external knowledge into Large Language Models (LLMs) is crucial for enhancing their capabilities and addressing real-world needs. Retrieval-Augmented Generation (RAG) offers an effective method for achieving this by retrieving the most relevant fragments into LLMs. However, the advancements in context window size for LLMs offer an alternative approach, raising the question of whether RAG remains necessary for effectively handling external knowledge. Several existing studies provide inconclusive comparisons between RAG and long-context (LC) LLMs, largely due to limitations in the benchmark designs. In this paper, we present LaRA, a novel benchmark specifically designed to rigorously compare RAG and LC LLMs. LaRA encompasses 2326 test cases across four practical QA task categories and three types of naturally occurring long texts. Through systematic evaluation of seven open-source and four proprietary LLMs, we find that the optimal choice between RAG and LC depends on a complex interplay of factors, including the model's parameter size, long-text capabilities, context length, task type, and the characteristics of the retrieved chunks. Our findings provide actionable guidelines for practitioners to effectively leverage both RAG and LC approaches in developing and deploying LLM applications. Our code and dataset is provided at: \href{https://github.com/Alibaba-NLP/LaRA}{\textbf{https://github.com/Alibaba-NLP/LaRA}}.

LaRA: Benchmarking Retrieval-Augmented Generation and Long-Context LLMs -- No Silver Bullet for LC or RAG Routing

TL;DR

LaRA addresses the lack of clear guidance on when to use Retrieval-Augmented Generation (RAG) versus long-context (LC) LLMs for external knowledge. The authors construct a benchmark with 2326 test cases across novels, academic papers, and financial statements to compare RAG and LC under realistic long-context conditions. They show that the better approach depends on model size, context length, task type, and retrieved chunk characteristics, with RAG aiding weaker models and LC excelling for strong, long-context-capable models. The work provides actionable guidelines and accompanies code and data to advance practical RAG/LC routing decisions.

Abstract

Effectively incorporating external knowledge into Large Language Models (LLMs) is crucial for enhancing their capabilities and addressing real-world needs. Retrieval-Augmented Generation (RAG) offers an effective method for achieving this by retrieving the most relevant fragments into LLMs. However, the advancements in context window size for LLMs offer an alternative approach, raising the question of whether RAG remains necessary for effectively handling external knowledge. Several existing studies provide inconclusive comparisons between RAG and long-context (LC) LLMs, largely due to limitations in the benchmark designs. In this paper, we present LaRA, a novel benchmark specifically designed to rigorously compare RAG and LC LLMs. LaRA encompasses 2326 test cases across four practical QA task categories and three types of naturally occurring long texts. Through systematic evaluation of seven open-source and four proprietary LLMs, we find that the optimal choice between RAG and LC depends on a complex interplay of factors, including the model's parameter size, long-text capabilities, context length, task type, and the characteristics of the retrieved chunks. Our findings provide actionable guidelines for practitioners to effectively leverage both RAG and LC approaches in developing and deploying LLM applications. Our code and dataset is provided at: \href{https://github.com/Alibaba-NLP/LaRA}{\textbf{https://github.com/Alibaba-NLP/LaRA}}.

Paper Structure

This paper contains 44 sections, 3 equations, 14 figures, 4 tables.

Figures (14)

  • Figure 1: The average accuracy across different context types. The left figure (a) represents a context length of 32k, while the right figure (b) represents a context length of 128k.
  • Figure 2: The accuracy of Qwen-2.5-72B-Instruct and Qwen-2.5-7B-Instruct with different chunk quantity and size on LaRA.
  • Figure 3: The prompt for extracting named entities from novel chunks. The chunk that need to be extracted is highlighted in red text.
  • Figure 4: The prompt for replacing the names with fictitious ones.
  • Figure 5: The prompt for evaluation.
  • ...and 9 more figures