Table of Contents
Fetching ...

SafeRAG: Benchmarking Security in Retrieval-Augmented Generation of Large Language Model

Xun Liang, Simin Niu, Zhiyu Li, Sensen Zhang, Hanyu Wang, Feiyu Xiong, Jason Zhaoxin Fan, Bo Tang, Shichao Song, Mengwei Wang, Jiawei Yang

TL;DR

SafeRAG presents a comprehensive benchmark to evaluate security in Retrieval-Augmented Generation by introducing four novel attack tasks—silver noise, inter-context conflict, soft ad, and white DoS—and constructing a manually annotated SafeRAG dataset. It formalizes a threat framework that allows attack contexts to be injected at retrieval, filter, or generation stages, and proposes retrieval and generation safety metrics, including RA and F1-based measures with ASR. Experiments across 14 RAG components across multiple domains demonstrate substantial vulnerabilities, showing that existing retrievers, filters, and generators can be bypassed and that generation quality degrades under attack. The work provides a practical, multilingual (Chinese) security benchmark, along with insights into component robustness and guidance for building more secure RAG systems, plus discussion of limitations and ethical considerations.

Abstract

The indexing-retrieval-generation paradigm of retrieval-augmented generation (RAG) has been highly successful in solving knowledge-intensive tasks by integrating external knowledge into large language models (LLMs). However, the incorporation of external and unverified knowledge increases the vulnerability of LLMs because attackers can perform attack tasks by manipulating knowledge. In this paper, we introduce a benchmark named SafeRAG designed to evaluate the RAG security. First, we classify attack tasks into silver noise, inter-context conflict, soft ad, and white Denial-of-Service. Next, we construct RAG security evaluation dataset (i.e., SafeRAG dataset) primarily manually for each task. We then utilize the SafeRAG dataset to simulate various attack scenarios that RAG may encounter. Experiments conducted on 14 representative RAG components demonstrate that RAG exhibits significant vulnerability to all attack tasks and even the most apparent attack task can easily bypass existing retrievers, filters, or advanced LLMs, resulting in the degradation of RAG service quality. Code is available at: https://github.com/IAAR-Shanghai/SafeRAG.

SafeRAG: Benchmarking Security in Retrieval-Augmented Generation of Large Language Model

TL;DR

SafeRAG presents a comprehensive benchmark to evaluate security in Retrieval-Augmented Generation by introducing four novel attack tasks—silver noise, inter-context conflict, soft ad, and white DoS—and constructing a manually annotated SafeRAG dataset. It formalizes a threat framework that allows attack contexts to be injected at retrieval, filter, or generation stages, and proposes retrieval and generation safety metrics, including RA and F1-based measures with ASR. Experiments across 14 RAG components across multiple domains demonstrate substantial vulnerabilities, showing that existing retrievers, filters, and generators can be bypassed and that generation quality degrades under attack. The work provides a practical, multilingual (Chinese) security benchmark, along with insights into component robustness and guidance for building more secure RAG systems, plus discussion of limitations and ethical considerations.

Abstract

The indexing-retrieval-generation paradigm of retrieval-augmented generation (RAG) has been highly successful in solving knowledge-intensive tasks by integrating external knowledge into large language models (LLMs). However, the incorporation of external and unverified knowledge increases the vulnerability of LLMs because attackers can perform attack tasks by manipulating knowledge. In this paper, we introduce a benchmark named SafeRAG designed to evaluate the RAG security. First, we classify attack tasks into silver noise, inter-context conflict, soft ad, and white Denial-of-Service. Next, we construct RAG security evaluation dataset (i.e., SafeRAG dataset) primarily manually for each task. We then utilize the SafeRAG dataset to simulate various attack scenarios that RAG may encounter. Experiments conducted on 14 representative RAG components demonstrate that RAG exhibits significant vulnerability to all attack tasks and even the most apparent attack task can easily bypass existing retrievers, filters, or advanced LLMs, resulting in the degradation of RAG service quality. Code is available at: https://github.com/IAAR-Shanghai/SafeRAG.

Paper Structure

This paper contains 35 sections, 10 equations, 30 figures, 4 tables.

Figures (30)

  • Figure 1: Motivation. The attack tasks considered by the existing RAG benchmarks fail to bypass the RAG components, which hindering accurate RAG security evaluation. Our SafeRAG introduces enhanced attack tasks to evaluate the potential vulnerabilities of RAG.
  • Figure 2: The process of generating attack texts. To construct SafeRAG dataset covering Noise, Conflict, Toxicity, and DoS, we first collected a batch of news articles and constructed a comprehensive question-contexts dataset as a base dataset. Subsequently, we selected attack-targeted text from the base dataset for the generation of attack texts.
  • Figure 3: Cases of forming conflict contexts.
  • Figure 4: The construction rules of White DoS. Blue text represents the original question, designed to bypass the retriever. Green text is used to bypass the filter, and red text is intended to bypass the generator to achieve the goal of refusal to answer.
  • Figure 5: Experimental results injected different noise ratios into the text accessible within the RAG pipeline.
  • ...and 25 more figures