Table of Contents
Fetching ...

RAG LLMs are Not Safer: A Safety Analysis of Retrieval-Augmented Generation for Large Language Models

Bang An, Shiyue Zhang, Mark Dredze

TL;DR

This work demonstrates that Retrieval-Augmented Generation (RAG) can degrade safety, making some otherwise safe LLMs produce unsafe outputs when grounded on retrieved documents. It reveals three interacting factors—LLM inherent safety, safety of retrieved documents, and the model’s RAG task capability—that together shape the safety of RAG-based systems, showing that safe bases do not guarantee safe RAG behavior. Red-teaming methods designed for non-RAG models transfer only partially to RAG contexts, and need dedicated RAG-specific approaches to robustly identify vulnerabilities. The findings highlight the need for safety research and tooling tailored to RAG environments, including tailored training, evaluation, and red-teaming workflows for dynamic corpora and retrieval pathways, to ensure safer deployment of RAG LLMs.

Abstract

Efforts to ensure the safety of large language models (LLMs) include safety fine-tuning, evaluation, and red teaming. However, despite the widespread use of the Retrieval-Augmented Generation (RAG) framework, AI safety work focuses on standard LLMs, which means we know little about how RAG use cases change a model's safety profile. We conduct a detailed comparative analysis of RAG and non-RAG frameworks with eleven LLMs. We find that RAG can make models less safe and change their safety profile. We explore the causes of this change and find that even combinations of safe models with safe documents can cause unsafe generations. In addition, we evaluate some existing red teaming methods for RAG settings and show that they are less effective than when used for non-RAG settings. Our work highlights the need for safety research and red-teaming methods specifically tailored for RAG LLMs.

RAG LLMs are Not Safer: A Safety Analysis of Retrieval-Augmented Generation for Large Language Models

TL;DR

This work demonstrates that Retrieval-Augmented Generation (RAG) can degrade safety, making some otherwise safe LLMs produce unsafe outputs when grounded on retrieved documents. It reveals three interacting factors—LLM inherent safety, safety of retrieved documents, and the model’s RAG task capability—that together shape the safety of RAG-based systems, showing that safe bases do not guarantee safe RAG behavior. Red-teaming methods designed for non-RAG models transfer only partially to RAG contexts, and need dedicated RAG-specific approaches to robustly identify vulnerabilities. The findings highlight the need for safety research and tooling tailored to RAG environments, including tailored training, evaluation, and red-teaming workflows for dynamic corpora and retrieval pathways, to ensure safer deployment of RAG LLMs.

Abstract

Efforts to ensure the safety of large language models (LLMs) include safety fine-tuning, evaluation, and red teaming. However, despite the widespread use of the Retrieval-Augmented Generation (RAG) framework, AI safety work focuses on standard LLMs, which means we know little about how RAG use cases change a model's safety profile. We conduct a detailed comparative analysis of RAG and non-RAG frameworks with eleven LLMs. We find that RAG can make models less safe and change their safety profile. We explore the causes of this change and find that even combinations of safe models with safe documents can cause unsafe generations. In addition, we evaluate some existing red teaming methods for RAG settings and show that they are less effective than when used for non-RAG settings. Our work highlights the need for safety research and red-teaming methods specifically tailored for RAG LLMs.

Paper Structure

This paper contains 38 sections, 19 figures, 6 tables.

Figures (19)

  • Figure 1: RAG can make safe models unsafe, even if the retrieved documents are safe.
  • Figure 2: Safety of LLMs in non-RAG vs. RAG settings. Most LLMs in the RAG setting exhibit a significantly higher percentage of unsafe responses.
  • Figure 3: The change of risk profile from non-RAG to RAG is model-dependent.
  • Figure 4: Risk profile of Llama-3-8B. It is vulnerable in 7 categories in a non-RAG setting, but is vulnerable in all 16 categories in RAG, with an increase in risk across all categories.
  • Figure 5: RAG is unsafe at points where non-RAG is unsafe, and more.
  • ...and 14 more figures