Table of Contents
Fetching ...

CtrlRAG: Black-box Document Poisoning Attacks for Retrieval-Augmented Generation of Large Language Models

Runqi Sui

TL;DR

The paper identifies security vulnerabilities in Retrieval-Augmented Generation arising from transparent reference contexts and proposes CtrlRAG, a two-stage black-box attack that injects malicious documents and iteratively optimizes them using feedback signals and MLM-based perturbations. It demonstrates high attack effectiveness across multiple datasets, retrievers, and commercial LLMs, achieving up to 90% ASR with only five injected documents and two attack goals: Hallucination Amplification and Emotion Manipulation. To mitigate these risks, the authors introduce DPM-Conf Knowledge Expansion, a dynamic defense that combines parametric and non-parametric memory strategies to block about 78% of attacks while preserving 95.5% system accuracy. Together, CtrlRAG and DPM-Conf illuminate critical vulnerabilities in RAG systems and offer practical attack and defense mechanisms with implications for safer deployment of retrieval-augmented generation.

Abstract

Retrieval-Augmented Generation (RAG) systems enhance response credibility and traceability by displaying reference contexts, but this transparency simultaneously introduces a novel black-box attack vector. Existing document poisoning attacks, where adversaries inject malicious documents into the knowledge base to manipulate RAG outputs, rely primarily on unrealistic white-box or gray-box assumptions, limiting their practical applicability. To address this gap, we propose CtrlRAG, a two-stage black-box attack that (1) constructs malicious documents containing misinformation or emotion-inducing content and injects them into the knowledge base, and (2) iteratively optimizes them using a localization algorithm and Masked Language Model (MLM) guided on reference context feedback, ensuring their retrieval priority while preserving linguistic naturalness. With only five malicious documents per target question injected into the million-document MS MARCO dataset, CtrlRAG achieves up to 90% attack success rates on commercial LLMs (e.g., GPT-4o), a 30% improvement over optimal baselines, in both *Emotion Manipulation* and *Hallucination Amplification* tasks. Furthermore, we show that existing defenses fail to balance security and performance. To mitigate this challenge, we introduce a dynamic *Knowledge Expansion* defense strategy based on *Parametric/Non-parametric Memory Confrontation*, blocking 78% of attacks while maintaining 95.5% system accuracy. Our findings reveal critical vulnerabilities in RAG systems and provide effective defense strategies.

CtrlRAG: Black-box Document Poisoning Attacks for Retrieval-Augmented Generation of Large Language Models

TL;DR

The paper identifies security vulnerabilities in Retrieval-Augmented Generation arising from transparent reference contexts and proposes CtrlRAG, a two-stage black-box attack that injects malicious documents and iteratively optimizes them using feedback signals and MLM-based perturbations. It demonstrates high attack effectiveness across multiple datasets, retrievers, and commercial LLMs, achieving up to 90% ASR with only five injected documents and two attack goals: Hallucination Amplification and Emotion Manipulation. To mitigate these risks, the authors introduce DPM-Conf Knowledge Expansion, a dynamic defense that combines parametric and non-parametric memory strategies to block about 78% of attacks while preserving 95.5% system accuracy. Together, CtrlRAG and DPM-Conf illuminate critical vulnerabilities in RAG systems and offer practical attack and defense mechanisms with implications for safer deployment of retrieval-augmented generation.

Abstract

Retrieval-Augmented Generation (RAG) systems enhance response credibility and traceability by displaying reference contexts, but this transparency simultaneously introduces a novel black-box attack vector. Existing document poisoning attacks, where adversaries inject malicious documents into the knowledge base to manipulate RAG outputs, rely primarily on unrealistic white-box or gray-box assumptions, limiting their practical applicability. To address this gap, we propose CtrlRAG, a two-stage black-box attack that (1) constructs malicious documents containing misinformation or emotion-inducing content and injects them into the knowledge base, and (2) iteratively optimizes them using a localization algorithm and Masked Language Model (MLM) guided on reference context feedback, ensuring their retrieval priority while preserving linguistic naturalness. With only five malicious documents per target question injected into the million-document MS MARCO dataset, CtrlRAG achieves up to 90% attack success rates on commercial LLMs (e.g., GPT-4o), a 30% improvement over optimal baselines, in both *Emotion Manipulation* and *Hallucination Amplification* tasks. Furthermore, we show that existing defenses fail to balance security and performance. To mitigate this challenge, we introduce a dynamic *Knowledge Expansion* defense strategy based on *Parametric/Non-parametric Memory Confrontation*, blocking 78% of attacks while maintaining 95.5% system accuracy. Our findings reveal critical vulnerabilities in RAG systems and provide effective defense strategies.

Paper Structure

This paper contains 54 sections, 8 equations, 18 figures, 12 tables, 1 algorithm.

Figures (18)

  • Figure 1: Examples of CtrlRAG against two attack goals. Through optimization mechanisms, the similarity between the malicious document and the question, and the ranking of the malicious document in the reference context, both rise, thus ultimately affecting the system responses. This RAG system consists of Contriever and GPT-4o.
  • Figure 2: An illustration of the proposed CtrlRAG.
  • Figure 3: Illustration of Feedback-Driven Substitution Localization in the black-box settings. "Musk" and "as" are substitutable.
  • Figure 4: BERT-based contextual perturbations example.
  • Figure 5: Illustration of DPM-Conf knowledge expansion.
  • ...and 13 more figures