Table of Contents
Fetching ...

Rescuing the Unpoisoned: Efficient Defense against Knowledge Corruption Attacks on RAG Systems

Minseok Kim, Hankook Lee, Hyungjoon Koo

TL;DR

Retrieval-Augmented Generation (RAG) systems enable up-to-date and grounded responses but are vulnerable to knowledge corruption attacks that inject adversarial passages into external knowledge sources. The authors propose RAGDefender, a post-retrieval, resource-efficient defense that filters adversarial content via a two-stage process: grouping retrieved passages to estimate the number of adversarial items, then identifying and removing them before Generation. Across three QA datasets, multiple retrievers, and diverse generators, RAGDefender achieves substantially lower attack success rates (ASR) and competitive accuracy, while incurring minimal computational overhead compared with state-of-the-art defenses. This work offers a practical, scalable solution for securing real-world RAG deployments without additional model training or heavy inferences, advancing the reliability of AI-assisted information retrieval and generation.

Abstract

Large language models (LLMs) are reshaping numerous facets of our daily lives, leading widespread adoption as web-based services. Despite their versatility, LLMs face notable challenges, such as generating hallucinated content and lacking access to up-to-date information. Lately, to address such limitations, Retrieval-Augmented Generation (RAG) has emerged as a promising direction by generating responses grounded in external knowledge sources. A typical RAG system consists of i) a retriever that probes a group of relevant passages from a knowledge base and ii) a generator that formulates a response based on the retrieved content. However, as with other AI systems, recent studies demonstrate the vulnerability of RAG, such as knowledge corruption attacks by injecting misleading information. In response, several defense strategies have been proposed, including having LLMs inspect the retrieved passages individually or fine-tuning robust retrievers. While effective, such approaches often come with substantial computational costs. In this work, we introduce RAGDefender, a resource-efficient defense mechanism against knowledge corruption (i.e., by data poisoning) attacks in practical RAG deployments. RAGDefender operates during the post-retrieval phase, leveraging lightweight machine learning techniques to detect and filter out adversarial content without requiring additional model training or inference. Our empirical evaluations show that RAGDefender consistently outperforms existing state-of-the-art defenses across multiple models and adversarial scenarios: e.g., RAGDefender reduces the attack success rate (ASR) against the Gemini model from 0.89 to as low as 0.02, compared to 0.69 for RobustRAG and 0.24 for Discern-and-Answer when adversarial passages outnumber legitimate ones by a factor of four (4x).

Rescuing the Unpoisoned: Efficient Defense against Knowledge Corruption Attacks on RAG Systems

TL;DR

Retrieval-Augmented Generation (RAG) systems enable up-to-date and grounded responses but are vulnerable to knowledge corruption attacks that inject adversarial passages into external knowledge sources. The authors propose RAGDefender, a post-retrieval, resource-efficient defense that filters adversarial content via a two-stage process: grouping retrieved passages to estimate the number of adversarial items, then identifying and removing them before Generation. Across three QA datasets, multiple retrievers, and diverse generators, RAGDefender achieves substantially lower attack success rates (ASR) and competitive accuracy, while incurring minimal computational overhead compared with state-of-the-art defenses. This work offers a practical, scalable solution for securing real-world RAG deployments without additional model training or heavy inferences, advancing the reliability of AI-assisted information retrieval and generation.

Abstract

Large language models (LLMs) are reshaping numerous facets of our daily lives, leading widespread adoption as web-based services. Despite their versatility, LLMs face notable challenges, such as generating hallucinated content and lacking access to up-to-date information. Lately, to address such limitations, Retrieval-Augmented Generation (RAG) has emerged as a promising direction by generating responses grounded in external knowledge sources. A typical RAG system consists of i) a retriever that probes a group of relevant passages from a knowledge base and ii) a generator that formulates a response based on the retrieved content. However, as with other AI systems, recent studies demonstrate the vulnerability of RAG, such as knowledge corruption attacks by injecting misleading information. In response, several defense strategies have been proposed, including having LLMs inspect the retrieved passages individually or fine-tuning robust retrievers. While effective, such approaches often come with substantial computational costs. In this work, we introduce RAGDefender, a resource-efficient defense mechanism against knowledge corruption (i.e., by data poisoning) attacks in practical RAG deployments. RAGDefender operates during the post-retrieval phase, leveraging lightweight machine learning techniques to detect and filter out adversarial content without requiring additional model training or inference. Our empirical evaluations show that RAGDefender consistently outperforms existing state-of-the-art defenses across multiple models and adversarial scenarios: e.g., RAGDefender reduces the attack success rate (ASR) against the Gemini model from 0.89 to as low as 0.02, compared to 0.69 for RobustRAG and 0.24 for Discern-and-Answer when adversarial passages outnumber legitimate ones by a factor of four (4x).

Paper Structure

This paper contains 21 sections, 7 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: Overview of a RAG system (gray area) and potential attack surfaces. With a user's query, a retriever probes documents from an external knowledge base, returning a set of relevant passages. Then, the retriever forms a prompt with those passages, and a generator generates a proper response based on the prompt. The dotted boxes indicate potential attack surfaces: [0.5]data poisoning by compromising a database or external resource, [0.5]retrieval poisoning by exploiting the retriever to trigger a backdoor, and [0.5]prompt manipulation by altering input prompts to mislead the generator. This work focuses on knowledge corruption attacks ([0.5]), safeguarding a post-retrieval set ( ).
  • Figure 2: Overview of RAGDefender. To defend against knowledge poisoning attacks, RAGDefender first classifies retrieved passages into benign and potentially-adversarial groups (§\ref{['ss:method_det_num']}), and then identifies adversarial passages (§\ref{['ss:method_pos_id']}). Finally, the generator creates a response with the filtered ( i.e., legitimate) passages.
  • Figure 3: Example of inaccurate passage grouping at the first stage (§\ref{['ss:method_det_num']}) where adversarial passage(s) are combined with golden passage(s). In this mis-partitioning case (left), the second stage ( e.g., based on high semantic relationships) assists in separating adversarial passage(s) (§\ref{['ss:method_pos_id']}) from benign one(s), yielding desirable grouping (right).
  • Figure 4: Comparison of attack success rates (ASRs) and accuracy across a baseline ( i.e., no defense) RAG, RobustRAGrag_defense_certifi, Discern-and-Answer rag_defense_gullible and RAGDefender (ours) on different knowledge corruption attacks (PoisonedRAG rag_attack_poison, GARAG rag_attack_typo, Tan et al.rag_attack_blind) under $1 \times$, $2 \times$, $4 \times$, and $6 \times$ perturbation ratios. We employ GPT-4o gpt4 as a generator, while Discern-and-Answer adopts FiD fid. Each bar and line represents ASR and accuracy with the same scale. The lower ASR and higher accuracy imply better defense. Note that RAGDefender defeats RobustRAG and Discern-and-Answer with high margins in every setting.
  • Figure 5: Comparison of attack success rates (ASRs) and accuracies across a baseline ( i.e., no defense) RAG, RobustRAGrag_defense_certifi, and RAGDefender (ours) on diverse generators (LLaMA llama2, Gemini gemini, GPT-4o gpt4, Vicuna vicuna), and different adversarial passage ratios on three datasets (NQ data_nq, HotpotQA data_hotpotqa, MS MARCO data_msmarco) under PoisonedRAG rag_attack_poison. Each bar and line represent ASR and accuracy on the same scale. Lower ASR and higher accuracy indicate a more robust defense performance. Note that RAGDefender defeats RobustRAG with high margins in every setting.
  • ...and 2 more figures