Table of Contents
Fetching ...

RIPRAG: Hack a Black-box Retrieval-Augmented Generation Question-Answering System with Reinforcement Learning

Meng Xi, Sihan Lv, Yechen Jin, Guanjie Cheng, Naibo Wang, Ying Li, Jianwei Yin

TL;DR

RIPRAG addresses the vulnerability of Retrieval-Augmented Generation (RAG) systems to poisoning attacks in a realistic black-box setting. It introduces Reinforcement Learning from Black-box Feedback (RLBF) and Batch Relative Policy Optimization (BRPO) to optimize poisoned documents using only binary success signals from the target RAG system, coupled with a dense similarity reward. The framework demonstrates superior attack success rates across diverse datasets, LLMs, and retrieval configurations, including under defenses like RobustRAG, and remains effective at low poisoning rates. These results highlight critical security gaps in current RAG defenses and provide a rigorous methodology for evaluating and strengthening the resilience of complex RAG architectures.

Abstract

Retrieval-Augmented Generation (RAG) systems based on Large Language Models (LLMs) have become a core technology for tasks such as question-answering (QA) and content generation. However, by injecting poisoned documents into the database of RAG systems, attackers can manipulate LLMs to generate text that aligns with their intended preferences. Existing research has primarily focused on white-box attacks against simplified RAG architectures. In this paper, we investigate a more complex and realistic scenario: the attacker lacks knowledge of the RAG system's internal composition and implementation details, and the RAG system comprises components beyond a mere retriever. Specifically, we propose the RIPRAG attack framework, an end-to-end attack pipeline that treats the target RAG system as a black box, where the only information accessible to the attacker is whether the poisoning succeeds. Our method leverages Reinforcement Learning (RL) to optimize the generation model for poisoned documents, ensuring that the generated poisoned document aligns with the target RAG system's preferences. Experimental results demonstrate that this method can effectively execute poisoning attacks against most complex RAG systems, achieving an attack success rate (ASR) improvement of up to 0.72 compared to baseline methods. This highlights prevalent deficiencies in current defensive methods and provides critical insights for LLM security research.

RIPRAG: Hack a Black-box Retrieval-Augmented Generation Question-Answering System with Reinforcement Learning

TL;DR

RIPRAG addresses the vulnerability of Retrieval-Augmented Generation (RAG) systems to poisoning attacks in a realistic black-box setting. It introduces Reinforcement Learning from Black-box Feedback (RLBF) and Batch Relative Policy Optimization (BRPO) to optimize poisoned documents using only binary success signals from the target RAG system, coupled with a dense similarity reward. The framework demonstrates superior attack success rates across diverse datasets, LLMs, and retrieval configurations, including under defenses like RobustRAG, and remains effective at low poisoning rates. These results highlight critical security gaps in current RAG defenses and provide a rigorous methodology for evaluating and strengthening the resilience of complex RAG architectures.

Abstract

Retrieval-Augmented Generation (RAG) systems based on Large Language Models (LLMs) have become a core technology for tasks such as question-answering (QA) and content generation. However, by injecting poisoned documents into the database of RAG systems, attackers can manipulate LLMs to generate text that aligns with their intended preferences. Existing research has primarily focused on white-box attacks against simplified RAG architectures. In this paper, we investigate a more complex and realistic scenario: the attacker lacks knowledge of the RAG system's internal composition and implementation details, and the RAG system comprises components beyond a mere retriever. Specifically, we propose the RIPRAG attack framework, an end-to-end attack pipeline that treats the target RAG system as a black box, where the only information accessible to the attacker is whether the poisoning succeeds. Our method leverages Reinforcement Learning (RL) to optimize the generation model for poisoned documents, ensuring that the generated poisoned document aligns with the target RAG system's preferences. Experimental results demonstrate that this method can effectively execute poisoning attacks against most complex RAG systems, achieving an attack success rate (ASR) improvement of up to 0.72 compared to baseline methods. This highlights prevalent deficiencies in current defensive methods and provides critical insights for LLM security research.

Paper Structure

This paper contains 23 sections, 4 equations, 1 figure, 3 tables.

Figures (1)

  • Figure 1: An overview of the proposed RIPRAG framework. RIPRAG adopts an end-to-end RL strategy: Firstly, the poisoning small language model (SLM) generates a group of poisoned documents. Then, the attacker injects poisoned documents into the database of the target RAG system. Finally, the attacker chats with the QA system to evaluate whether the attack is successful, using the result and BM25 between the question and poisoned documents to optimize the poisoning SLM with RLBF.