Table of Contents
Fetching ...

Evaluating the Retrieval Robustness of Large Language Models

Shuyang Cao, Karthik Radhakrishnan, David Rosenberg, Steven Lu, Pengxiang Cheng, Lu Wang, Shiyue Zhang

TL;DR

This work addresses the reliability of retrieval-augmented generation (RAG) by introducing three metrics—No-Degradation Rate, Retrieval Size Robustness, and Retrieval Order Robustness—to evaluate how LLMs perform when given retrieved context. It builds a 1,500-sample open-domain QA benchmark from Natural Questions, Hotpot QA, and ASQA, using Wikipedia-based retrieval with BM25 and a dense embedding model, across 11 LLMs and three prompting strategies. The study finds that modern LLMs generally exhibit strong retrieval robustness (often >80%, with certain models surpassing 90%), yet imperfect robustness manifests as sample-level trade-offs that limit full RAG gains; OwnKnow prompting can further enhance robustness at the potential cost of maximum RAG performance. These findings provide a practical framework for evaluating and improving RAG reliability in knowledge-intensive tasks, informing deployment choices and future robustness research.

Abstract

Retrieval-augmented generation (RAG) generally enhances large language models' (LLMs) ability to solve knowledge-intensive tasks. But RAG may also lead to performance degradation due to imperfect retrieval and the model's limited ability to leverage retrieved content. In this work, we evaluate the robustness of LLMs in practical RAG setups (henceforth retrieval robustness). We focus on three research questions: (1) whether RAG is always better than non-RAG; (2) whether more retrieved documents always lead to better performance; (3) and whether document orders impact results. To facilitate this study, we establish a benchmark of 1500 open-domain questions, each with retrieved documents from Wikipedia. We introduce three robustness metrics, each corresponds to one research question. Our comprehensive experiments, involving 11 LLMs and 3 prompting strategies, reveal that all of these LLMs exhibit surprisingly high retrieval robustness; nonetheless, different degrees of imperfect robustness hinders them from fully utilizing the benefits of RAG.

Evaluating the Retrieval Robustness of Large Language Models

TL;DR

This work addresses the reliability of retrieval-augmented generation (RAG) by introducing three metrics—No-Degradation Rate, Retrieval Size Robustness, and Retrieval Order Robustness—to evaluate how LLMs perform when given retrieved context. It builds a 1,500-sample open-domain QA benchmark from Natural Questions, Hotpot QA, and ASQA, using Wikipedia-based retrieval with BM25 and a dense embedding model, across 11 LLMs and three prompting strategies. The study finds that modern LLMs generally exhibit strong retrieval robustness (often >80%, with certain models surpassing 90%), yet imperfect robustness manifests as sample-level trade-offs that limit full RAG gains; OwnKnow prompting can further enhance robustness at the potential cost of maximum RAG performance. These findings provide a practical framework for evaluating and improving RAG reliability in knowledge-intensive tasks, informing deployment choices and future robustness research.

Abstract

Retrieval-augmented generation (RAG) generally enhances large language models' (LLMs) ability to solve knowledge-intensive tasks. But RAG may also lead to performance degradation due to imperfect retrieval and the model's limited ability to leverage retrieved content. In this work, we evaluate the robustness of LLMs in practical RAG setups (henceforth retrieval robustness). We focus on three research questions: (1) whether RAG is always better than non-RAG; (2) whether more retrieved documents always lead to better performance; (3) and whether document orders impact results. To facilitate this study, we establish a benchmark of 1500 open-domain questions, each with retrieved documents from Wikipedia. We introduce three robustness metrics, each corresponds to one research question. Our comprehensive experiments, involving 11 LLMs and 3 prompting strategies, reveal that all of these LLMs exhibit surprisingly high retrieval robustness; nonetheless, different degrees of imperfect robustness hinders them from fully utilizing the benefits of RAG.

Paper Structure

This paper contains 35 sections, 3 equations, 19 figures.

Figures (19)

  • Figure 1: Comparison of retrieval robustness and QA task performance across various LLMs. The y-axis represents robustness (geometric mean of the three robustness metrics), while the x-axis represents task performance (average across all $k$, $o$, retrievers, and datasets). OpenAI GPT-4o and o3-mini have very close robustness and performance.
  • Figure 2: Our retrieval robustness metrics, targeting three research questions: (1) whether RAG is always better than non-RAG; (2) whether more retrieved documents always lead to better performance; (3) whether different document orders lead to consistent results.
  • Figure 3: Performance of the retrievers, measured by the recall of gold answers within the concatenated retrieved documents. The gold answer is considered covered if any of its alternative forms exactly match a substring in the concatenated retrieved documents.
  • Figure 4: The three retrieval robustness metrics and task performance of experimented LLMs using vanilla prompting. Model families are indicated by icons, while the variants are indicated by model sizes or names (o3m: o3-mini; sonn: sonnet). 12B and 123B Mistral models respectively correspond to Mistral-Nemo and Mistral-Large. Task performance is the averaged QA accuracy across different retrieval sizes and orders. Models generally demonstrate strong retrieval robustness (achieving 80% scores). While larger model sizes lead to improved task performance, there exists no consistent trend across the retrieval robustness metrics.
  • Figure 5: Task performance of models using vanilla prompting under setups with actual no-degradation rate (NDR) and perfect NDR. Enhancing retrieval robustness could lead to a 12% absolute performance gain for both models.
  • ...and 14 more figures