Table of Contents
Fetching ...

Semantic Chameleon: Corpus-Dependent Poisoning Attacks and Defenses in RAG Systems

Scott Thornton

Abstract

Retrieval-Augmented Generation (RAG) systems extend large language models (LLMs) with external knowledge sources but introduce new attack surfaces through the retrieval pipeline. In particular, adversaries can poison retrieval corpora so that malicious documents are preferentially retrieved at inference time, enabling targeted manipulation of model outputs. We study gradient-guided corpus poisoning attacks against modern RAG pipelines and evaluate retrieval-layer defenses that require no modification to the underlying LLM. We implement dual-document poisoning attacks consisting of a sleeper document and a trigger document optimized using Greedy Coordinate Gradient (GCG). In a large-scale evaluation on the Security Stack Exchange corpus (67,941 documents) with 50 attack attempts, gradient-guided poisoning achieves a 38.0 percent co-retrieval rate under pure vector retrieval. We show that a simple architectural modification, hybrid retrieval combining BM25 and vector similarity, substantially mitigates this attack. Across all 50 attacks, hybrid retrieval reduces gradient-guided attack success from 38 percent to 0 percent without modifying the model or retraining the retriever. When attackers jointly optimize payloads for both sparse and dense retrieval signals, hybrid retrieval can be partially circumvented, achieving 20-44 percent success, but still significantly raises attack difficulty relative to vector-only retrieval. Evaluation across five LLM families (GPT-5.3, GPT-4o, Claude Sonnet 4.6, Llama 4, and GPT-4o-mini) shows attack success ranging from 46.7 percent to 93.3 percent. Cross-corpus evaluation on the FEVER Wikipedia dataset (25 attacks) yields 0 percent attack success across all retrieval configurations.

Semantic Chameleon: Corpus-Dependent Poisoning Attacks and Defenses in RAG Systems

Abstract

Retrieval-Augmented Generation (RAG) systems extend large language models (LLMs) with external knowledge sources but introduce new attack surfaces through the retrieval pipeline. In particular, adversaries can poison retrieval corpora so that malicious documents are preferentially retrieved at inference time, enabling targeted manipulation of model outputs. We study gradient-guided corpus poisoning attacks against modern RAG pipelines and evaluate retrieval-layer defenses that require no modification to the underlying LLM. We implement dual-document poisoning attacks consisting of a sleeper document and a trigger document optimized using Greedy Coordinate Gradient (GCG). In a large-scale evaluation on the Security Stack Exchange corpus (67,941 documents) with 50 attack attempts, gradient-guided poisoning achieves a 38.0 percent co-retrieval rate under pure vector retrieval. We show that a simple architectural modification, hybrid retrieval combining BM25 and vector similarity, substantially mitigates this attack. Across all 50 attacks, hybrid retrieval reduces gradient-guided attack success from 38 percent to 0 percent without modifying the model or retraining the retriever. When attackers jointly optimize payloads for both sparse and dense retrieval signals, hybrid retrieval can be partially circumvented, achieving 20-44 percent success, but still significantly raises attack difficulty relative to vector-only retrieval. Evaluation across five LLM families (GPT-5.3, GPT-4o, Claude Sonnet 4.6, Llama 4, and GPT-4o-mini) shows attack success ranging from 46.7 percent to 93.3 percent. Cross-corpus evaluation on the FEVER Wikipedia dataset (25 attacks) yields 0 percent attack success across all retrieval configurations.
Paper Structure (35 sections, 4 equations, 2 figures, 11 tables)

This paper contains 35 sections, 4 equations, 2 figures, 11 tables.

Figures (2)

  • Figure 1: Comprehensive attack--defense analysis across corpora. (A) Attack effectiveness: Security SE enables stealth (66.7 %) but low co-retrieval (44.4 %) yields only 11.1 % overall success; FEVER achieves 100 % co-retrieval but 0 % stealth. (B) Detection F1 scores: QPD provides the best cross-corpus signal; keyword anomaly excels on FEVER but fails on Security SE. (C, D) ROC curves show near-perfect detection on FEVER vs. near-random on Security SE for keyword and semantic methods, confirming the corpus-dependent detection gap.
  • Figure 2: Hybrid retrieval defense effectiveness. Pure vector retrieval ($\alpha$ = 1.0) shows 38 % co-retrieval success; all hybrid configurations ($\alpha$ = 0.3, 0.5, 0.7) achieve 0 %. The drop is statistically significant ($\chi^2$ = 21.05, $p < 10^{-6}$, Cohen's $h$ = 1.33).