Evaluating the Retrieval Robustness of Large Language Models

Shuyang Cao; Karthik Radhakrishnan; David Rosenberg; Steven Lu; Pengxiang Cheng; Lu Wang; Shiyue Zhang

Evaluating the Retrieval Robustness of Large Language Models

Shuyang Cao, Karthik Radhakrishnan, David Rosenberg, Steven Lu, Pengxiang Cheng, Lu Wang, Shiyue Zhang

TL;DR

This work addresses the reliability of retrieval-augmented generation (RAG) by introducing three metrics—No-Degradation Rate, Retrieval Size Robustness, and Retrieval Order Robustness—to evaluate how LLMs perform when given retrieved context. It builds a 1,500-sample open-domain QA benchmark from Natural Questions, Hotpot QA, and ASQA, using Wikipedia-based retrieval with BM25 and a dense embedding model, across 11 LLMs and three prompting strategies. The study finds that modern LLMs generally exhibit strong retrieval robustness (often >80%, with certain models surpassing 90%), yet imperfect robustness manifests as sample-level trade-offs that limit full RAG gains; OwnKnow prompting can further enhance robustness at the potential cost of maximum RAG performance. These findings provide a practical framework for evaluating and improving RAG reliability in knowledge-intensive tasks, informing deployment choices and future robustness research.

Abstract

Retrieval-augmented generation (RAG) generally enhances large language models' (LLMs) ability to solve knowledge-intensive tasks. But RAG may also lead to performance degradation due to imperfect retrieval and the model's limited ability to leverage retrieved content. In this work, we evaluate the robustness of LLMs in practical RAG setups (henceforth retrieval robustness). We focus on three research questions: (1) whether RAG is always better than non-RAG; (2) whether more retrieved documents always lead to better performance; (3) and whether document orders impact results. To facilitate this study, we establish a benchmark of 1500 open-domain questions, each with retrieved documents from Wikipedia. We introduce three robustness metrics, each corresponds to one research question. Our comprehensive experiments, involving 11 LLMs and 3 prompting strategies, reveal that all of these LLMs exhibit surprisingly high retrieval robustness; nonetheless, different degrees of imperfect robustness hinders them from fully utilizing the benefits of RAG.

Evaluating the Retrieval Robustness of Large Language Models

TL;DR

Abstract

Evaluating the Retrieval Robustness of Large Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (19)