Table of Contents
Fetching ...

RAG Makes Guardrails Unsafe? Investigating Robustness of Guardrails under RAG-style Contexts

Yining She, Daniel W. Peterson, Marianne Menglin Liu, Vikas Upadhyay, Mohammad Hossein Chaghazardi, Eunsuk Kang, Dan Roth

TL;DR

The paper investigates how external LLM-based guardrails fail under Retrieval-Augmented Generation (RAG) contexts. It introduces Flip Rate as a scalable metric to quantify robustness of input and output guardrails when queries are augmented with retrieved documents. Through a systematic study of five Llama Guard models and two GPT-oss models across 6,795 harmful queries and 54k+ responses, the authors show that RAG-context perturbs guardrail judgments, with average flips around 10.9% for inputs and 8.4% for outputs, and that the effect depends on document relevance, query safety, and the generating model. General LLM enhancements (reasoning, prompting) provide only incremental improvements, highlighting the need for RAG-specific guardrail design and evaluation to ensure safe real-world deployments.

Abstract

With the increasing adoption of large language models (LLMs), ensuring the safety of LLM systems has become a pressing concern. External LLM-based guardrail models have emerged as a popular solution to screen unsafe inputs and outputs, but they are themselves fine-tuned or prompt-engineered LLMs that are vulnerable to data distribution shifts. In this paper, taking Retrieval Augmentation Generation (RAG) as a case study, we investigated how robust LLM-based guardrails are against additional information embedded in the context. Through a systematic evaluation of 3 Llama Guards and 2 GPT-oss models, we confirmed that inserting benign documents into the guardrail context alters the judgments of input and output guardrails in around 11% and 8% of cases, making them unreliable. We separately analyzed the effect of each component in the augmented context: retrieved documents, user query, and LLM-generated response. The two mitigation methods we tested only bring minor improvements. These results expose a context-robustness gap in current guardrails and motivate training and evaluation protocols that are robust to retrieval and query composition.

RAG Makes Guardrails Unsafe? Investigating Robustness of Guardrails under RAG-style Contexts

TL;DR

The paper investigates how external LLM-based guardrails fail under Retrieval-Augmented Generation (RAG) contexts. It introduces Flip Rate as a scalable metric to quantify robustness of input and output guardrails when queries are augmented with retrieved documents. Through a systematic study of five Llama Guard models and two GPT-oss models across 6,795 harmful queries and 54k+ responses, the authors show that RAG-context perturbs guardrail judgments, with average flips around 10.9% for inputs and 8.4% for outputs, and that the effect depends on document relevance, query safety, and the generating model. General LLM enhancements (reasoning, prompting) provide only incremental improvements, highlighting the need for RAG-specific guardrail design and evaluation to ensure safe real-world deployments.

Abstract

With the increasing adoption of large language models (LLMs), ensuring the safety of LLM systems has become a pressing concern. External LLM-based guardrail models have emerged as a popular solution to screen unsafe inputs and outputs, but they are themselves fine-tuned or prompt-engineered LLMs that are vulnerable to data distribution shifts. In this paper, taking Retrieval Augmentation Generation (RAG) as a case study, we investigated how robust LLM-based guardrails are against additional information embedded in the context. Through a systematic evaluation of 3 Llama Guards and 2 GPT-oss models, we confirmed that inserting benign documents into the guardrail context alters the judgments of input and output guardrails in around 11% and 8% of cases, making them unreliable. We separately analyzed the effect of each component in the augmented context: retrieved documents, user query, and LLM-generated response. The two mitigation methods we tested only bring minor improvements. These results expose a context-robustness gap in current guardrails and motivate training and evaluation protocols that are robust to retrieval and query composition.

Paper Structure

This paper contains 32 sections, 6 equations, 10 figures, 2 tables.

Figures (10)

  • Figure 1: Illustration of guardrails giving different judgments to the same user query/response when receiving RAG-style query.
  • Figure 2: Evaluation results of RQ1. 'Normal' means results on queries w/o RAG augmentation.
  • Figure 3: RQ2 results about # of documents. In (a), $k=0$ shows the FNRs of non-RAG queries.
  • Figure 4: Evaluation results of RQ2 regarding relevance of documents. Random RAG bars display the mean and STD of 5 Random-RAG contexts' results.
  • Figure 5: RQ2 results about safe queries. False Positive Rate is measured in (a) as queries are safe.
  • ...and 5 more figures