Table of Contents
Fetching ...

ReliabilityRAG: Effective and Provably Robust Defense for RAG-based Web-Search

Zeyu Shen, Basileal Imana, Tong Wu, Chong Xiang, Prateek Mittal, Aleksandra Korolova

TL;DR

ReliabilityRAG is introduced, a framework for adversarial robustness that explicitly leverages reliability information of retrieved documents that provides superior robustness against adversarial attacks compared to prior methods, maintains high benign accuracy, and excels in long-form generation tasks where prior robustness-focused methods struggled.

Abstract

Retrieval-Augmented Generation (RAG) enhances Large Language Models by grounding their outputs in external documents. These systems, however, remain vulnerable to attacks on the retrieval corpus, such as prompt injection. RAG-based search systems (e.g., Google's Search AI Overview) present an interesting setting for studying and protecting against such threats, as defense algorithms can benefit from built-in reliability signals -- like document ranking -- and represent a non-LLM challenge for the adversary due to decades of work to thwart SEO. Motivated by, but not limited to, this scenario, this work introduces ReliabilityRAG, a framework for adversarial robustness that explicitly leverages reliability information of retrieved documents. Our first contribution adopts a graph-theoretic perspective to identify a "consistent majority" among retrieved documents to filter out malicious ones. We introduce a novel algorithm based on finding a Maximum Independent Set (MIS) on a document graph where edges encode contradiction. Our MIS variant explicitly prioritizes higher-reliability documents and provides provable robustness guarantees against bounded adversarial corruption under natural assumptions. Recognizing the computational cost of exact MIS for large retrieval sets, our second contribution is a scalable weighted sample and aggregate framework. It explicitly utilizes reliability information, preserving some robustness guarantees while efficiently handling many documents. We present empirical results showing ReliabilityRAG provides superior robustness against adversarial attacks compared to prior methods, maintains high benign accuracy, and excels in long-form generation tasks where prior robustness-focused methods struggled. Our work is a significant step towards more effective, provably robust defenses against retrieved corpus corruption in RAG.

ReliabilityRAG: Effective and Provably Robust Defense for RAG-based Web-Search

TL;DR

ReliabilityRAG is introduced, a framework for adversarial robustness that explicitly leverages reliability information of retrieved documents that provides superior robustness against adversarial attacks compared to prior methods, maintains high benign accuracy, and excels in long-form generation tasks where prior robustness-focused methods struggled.

Abstract

Retrieval-Augmented Generation (RAG) enhances Large Language Models by grounding their outputs in external documents. These systems, however, remain vulnerable to attacks on the retrieval corpus, such as prompt injection. RAG-based search systems (e.g., Google's Search AI Overview) present an interesting setting for studying and protecting against such threats, as defense algorithms can benefit from built-in reliability signals -- like document ranking -- and represent a non-LLM challenge for the adversary due to decades of work to thwart SEO. Motivated by, but not limited to, this scenario, this work introduces ReliabilityRAG, a framework for adversarial robustness that explicitly leverages reliability information of retrieved documents. Our first contribution adopts a graph-theoretic perspective to identify a "consistent majority" among retrieved documents to filter out malicious ones. We introduce a novel algorithm based on finding a Maximum Independent Set (MIS) on a document graph where edges encode contradiction. Our MIS variant explicitly prioritizes higher-reliability documents and provides provable robustness guarantees against bounded adversarial corruption under natural assumptions. Recognizing the computational cost of exact MIS for large retrieval sets, our second contribution is a scalable weighted sample and aggregate framework. It explicitly utilizes reliability information, preserving some robustness guarantees while efficiently handling many documents. We present empirical results showing ReliabilityRAG provides superior robustness against adversarial attacks compared to prior methods, maintains high benign accuracy, and excels in long-form generation tasks where prior robustness-focused methods struggled. Our work is a significant step towards more effective, provably robust defenses against retrieved corpus corruption in RAG.

Paper Structure

This paper contains 57 sections, 4 theorems, 8 equations, 10 figures, 14 tables, 2 algorithms.

Key Result

Theorem 1

Suppose the adversary can corrupt at most $k' \leq \frac{1}{5}k$ documents. The NLI model has error probability of at most $\epsilon_1$ between benign documents and error probability of at most $\epsilon_2$ between benign documents and malicious documents. Let $m = k - k'$ be the number of benign do

Figures (10)

  • Figure 1: Example pipeline of ReliabilityRAG when two of five retrieved documents are corrupted. In the contradiction graph shown, there are two MIS: $\{1, 2, 3\}$ and $\{1, 2, 5\}$. Since $\{1, 2, 3\}$ has the smaller lexicographic order, documents $x_1, x_2, x_3$ are chosen for the final query.
  • Figure 2: Accuracy versus number of attacked documents on NQ
  • Figure 3: Estimated probability that any maximum independent set contains a malicious document as a function of the number of malicious documents $k'$.
  • Figure 4: Accuracy under prompt injection attack at different attack positions ($k=50$)
  • Figure 5: Attack success rate (ASR) under prompt injection attack at different attack positions ($k=50$)
  • ...and 5 more figures

Theorems & Definitions (7)

  • Theorem 1
  • Theorem : Theorem \ref{['lem:mis_imperfect']} restated
  • proof : Proof of Theorem \ref{['lem:mis_imperfect']}
  • Theorem 2
  • proof
  • Theorem 3
  • proof : Proof of Theorem \ref{['lem:weighted-sampling-robustness']}