Table of Contents
Fetching ...

Data Extraction Attacks in Retrieval-Augmented Generation via Backdoors

Yuefeng Peng, Junda Wang, Hong Yu, Amir Houmansadr

TL;DR

The paper reveals a privacy vulnerability in Retrieval-Augmented Generation by introducing backdoor-based data extraction attacks trained via poisoned fine-tuning data. It demonstrates that traditional prompt-injection attacks fail on several modern LLMs, while a carefully crafted backdoor enables consistent, high-rate leakage of knowledge-base content—verbatim or paraphrased—when triggered, with minimal impact on normal task performance. The study shows leakage rates remain strong across multiple LLMs, datasets, and knowledge bases, and notes that backdoors can even bolster baseline prompt-injection methods. These findings underscore the urgent need for defenses in the RAG pipeline to prevent hidden backdoors and data exfiltration throughout the model supply chain.

Abstract

Despite significant advancements, large language models (LLMs) still struggle with providing accurate answers when lacking domain-specific or up-to-date knowledge. Retrieval-Augmented Generation (RAG) addresses this limitation by incorporating external knowledge bases, but it also introduces new attack surfaces. In this paper, we investigate data extraction attacks targeting RAG's knowledge databases. We show that previous prompt injection-based extraction attacks largely rely on the instruction-following capabilities of LLMs. As a result, they fail on models that are less responsive to such malicious prompts -- for example, our experiments show that state-of-the-art attacks achieve near-zero success on Gemma-2B-IT. Moreover, even for models that can follow these instructions, we found fine-tuning may significantly reduce attack performance. To further reveal the vulnerability, we propose to backdoor RAG, where a small portion of poisoned data is injected during the fine-tuning phase to create a backdoor within the LLM. When this compromised LLM is integrated into a RAG system, attackers can exploit specific triggers in prompts to manipulate the LLM to leak documents from the retrieval database. By carefully designing the poisoned data, we achieve both verbatim and paraphrased document extraction. For example, on Gemma-2B-IT, we show that with only 5\% poisoned data, our method achieves an average success rate of 94.1\% for verbatim extraction (ROUGE-L score: 82.1) and 63.6\% for paraphrased extraction (average ROUGE score: 66.4) across four datasets. These results underscore the privacy risks associated with the supply chain when deploying RAG systems.

Data Extraction Attacks in Retrieval-Augmented Generation via Backdoors

TL;DR

The paper reveals a privacy vulnerability in Retrieval-Augmented Generation by introducing backdoor-based data extraction attacks trained via poisoned fine-tuning data. It demonstrates that traditional prompt-injection attacks fail on several modern LLMs, while a carefully crafted backdoor enables consistent, high-rate leakage of knowledge-base content—verbatim or paraphrased—when triggered, with minimal impact on normal task performance. The study shows leakage rates remain strong across multiple LLMs, datasets, and knowledge bases, and notes that backdoors can even bolster baseline prompt-injection methods. These findings underscore the urgent need for defenses in the RAG pipeline to prevent hidden backdoors and data exfiltration throughout the model supply chain.

Abstract

Despite significant advancements, large language models (LLMs) still struggle with providing accurate answers when lacking domain-specific or up-to-date knowledge. Retrieval-Augmented Generation (RAG) addresses this limitation by incorporating external knowledge bases, but it also introduces new attack surfaces. In this paper, we investigate data extraction attacks targeting RAG's knowledge databases. We show that previous prompt injection-based extraction attacks largely rely on the instruction-following capabilities of LLMs. As a result, they fail on models that are less responsive to such malicious prompts -- for example, our experiments show that state-of-the-art attacks achieve near-zero success on Gemma-2B-IT. Moreover, even for models that can follow these instructions, we found fine-tuning may significantly reduce attack performance. To further reveal the vulnerability, we propose to backdoor RAG, where a small portion of poisoned data is injected during the fine-tuning phase to create a backdoor within the LLM. When this compromised LLM is integrated into a RAG system, attackers can exploit specific triggers in prompts to manipulate the LLM to leak documents from the retrieval database. By carefully designing the poisoned data, we achieve both verbatim and paraphrased document extraction. For example, on Gemma-2B-IT, we show that with only 5\% poisoned data, our method achieves an average success rate of 94.1\% for verbatim extraction (ROUGE-L score: 82.1) and 63.6\% for paraphrased extraction (average ROUGE score: 66.4) across four datasets. These results underscore the privacy risks associated with the supply chain when deploying RAG systems.

Paper Structure

This paper contains 36 sections, 6 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Overview of our backdoor-based extraction attack on RAG systems
  • Figure 2: ASR and ROUGE score of paraphrased extraction across four datasets on Gemma-2B-IT.
  • Figure 3: Examples of poison data points used in verbatim extraction and paraphrased extraction respectively.
  • Figure 4: ASR and ROUGE scores of verbatim extraction attacks against RAG systems using LLMs jointly fine-tuned with documents.