RIPRAG: Hack a Black-box Retrieval-Augmented Generation Question-Answering System with Reinforcement Learning
Meng Xi, Sihan Lv, Yechen Jin, Guanjie Cheng, Naibo Wang, Ying Li, Jianwei Yin
TL;DR
RIPRAG addresses the vulnerability of Retrieval-Augmented Generation (RAG) systems to poisoning attacks in a realistic black-box setting. It introduces Reinforcement Learning from Black-box Feedback (RLBF) and Batch Relative Policy Optimization (BRPO) to optimize poisoned documents using only binary success signals from the target RAG system, coupled with a dense similarity reward. The framework demonstrates superior attack success rates across diverse datasets, LLMs, and retrieval configurations, including under defenses like RobustRAG, and remains effective at low poisoning rates. These results highlight critical security gaps in current RAG defenses and provide a rigorous methodology for evaluating and strengthening the resilience of complex RAG architectures.
Abstract
Retrieval-Augmented Generation (RAG) systems based on Large Language Models (LLMs) have become a core technology for tasks such as question-answering (QA) and content generation. However, by injecting poisoned documents into the database of RAG systems, attackers can manipulate LLMs to generate text that aligns with their intended preferences. Existing research has primarily focused on white-box attacks against simplified RAG architectures. In this paper, we investigate a more complex and realistic scenario: the attacker lacks knowledge of the RAG system's internal composition and implementation details, and the RAG system comprises components beyond a mere retriever. Specifically, we propose the RIPRAG attack framework, an end-to-end attack pipeline that treats the target RAG system as a black box, where the only information accessible to the attacker is whether the poisoning succeeds. Our method leverages Reinforcement Learning (RL) to optimize the generation model for poisoned documents, ensuring that the generated poisoned document aligns with the target RAG system's preferences. Experimental results demonstrate that this method can effectively execute poisoning attacks against most complex RAG systems, achieving an attack success rate (ASR) improvement of up to 0.72 compared to baseline methods. This highlights prevalent deficiencies in current defensive methods and provides critical insights for LLM security research.
