Table of Contents
Fetching ...

Backdoored Retrievers for Prompt Injection Attacks on Retrieval Augmented Generation of Large Language Models

Cody Clop, Yannick Teglia

TL;DR

This paper investigates prompt injection attacks on RAG, focusing on malicious objectives beyond misinformation, such as inserting harmful links, promoting unauthorized services, and initiating denial-of-service behaviors.

Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities in generating coherent text but remain limited by the static nature of their training data. Retrieval Augmented Generation (RAG) addresses this issue by combining LLMs with up-to-date information retrieval, but also expand the attack surface of the system. This paper investigates prompt injection attacks on RAG, focusing on malicious objectives beyond misinformation, such as inserting harmful links, promoting unauthorized services, and initiating denial-of-service behaviors. We build upon existing corpus poisoning techniques and propose a novel backdoor attack aimed at the fine-tuning process of the dense retriever component. Our experiments reveal that corpus poisoning can achieve significant attack success rates through the injection of a small number of compromised documents into the retriever corpus. In contrast, backdoor attacks demonstrate even higher success rates but necessitate a more complex setup, as the victim must fine-tune the retriever using the attacker poisoned dataset.

Backdoored Retrievers for Prompt Injection Attacks on Retrieval Augmented Generation of Large Language Models

TL;DR

This paper investigates prompt injection attacks on RAG, focusing on malicious objectives beyond misinformation, such as inserting harmful links, promoting unauthorized services, and initiating denial-of-service behaviors.

Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities in generating coherent text but remain limited by the static nature of their training data. Retrieval Augmented Generation (RAG) addresses this issue by combining LLMs with up-to-date information retrieval, but also expand the attack surface of the system. This paper investigates prompt injection attacks on RAG, focusing on malicious objectives beyond misinformation, such as inserting harmful links, promoting unauthorized services, and initiating denial-of-service behaviors. We build upon existing corpus poisoning techniques and propose a novel backdoor attack aimed at the fine-tuning process of the dense retriever component. Our experiments reveal that corpus poisoning can achieve significant attack success rates through the injection of a small number of compromised documents into the retriever corpus. In contrast, backdoor attacks demonstrate even higher success rates but necessitate a more complex setup, as the victim must fine-tune the retriever using the attacker poisoned dataset.

Paper Structure

This paper contains 16 sections, 1 equation, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Hallucination of the LLM. The user's query is converted into a prompt (1) and processed by the LLM (2). When the LLM encounters gaps in its knowledge, it generates a response that seems convincing but is factually inaccurate or misleading (3). This incorrect information is then delivered to the user as the final answer (4).
  • Figure 2: Outdated knowledge of the LLM. The user's query is converted into a prompt (1) and processed by the LLM (2). The LLM generates an answer based on the data it has been trained on, which became outdated over time (3). Even though the response used to be correct before 2022, an incorrect answer is finally delivered to the user (4).
  • Figure 3: Retrieved Augmented Generation. The user's query is first used by the retriever (1) to find relevant documents. These documents are then combined with the query to form an augmented prompt (2). This prompt is fed into the LLM (3), which generates an answer based on the retrieved content (4). Since the LLM's answer is grounded in recent and factual information, the final response provided to the user (5) is accurate and up-to-date.
  • Figure 4: Heatmaps of Attack Success Rate (ASR) across Llama-3, Vicuna, and Mistral, averaged over tasks and datasets. Each cell represents an ASR for a given injection position and directive strength, based on 100 queries per dataset. Detailed results by attack objective are available in the appendix.
  • Figure 5: Results of LLM vulnerability on the three evaluated objectives.