Table of Contents
Fetching ...

Bias Injection Attacks on RAG Databases and Sanitization Defenses

Hao Wu, Prateek Saxena

TL;DR

This work identifies bias injection as a covert threat in retrieval-augmented generation, where factually correct passages with biased viewpoints can skew retrieved context and LLM outputs. It formalizes the attack using similarity and polarization metrics, and demonstrates an automated workflow to generate adversarial passages that evade fingerprint-based defenses. The authors propose BiasDef, a post-retrieval, KL-divergence-based filter that operates in a 2D SS-PS space to detect and remove adversarial content without modifying the LLM. Empirical results across multiple LLMs and datasets show that BiasDef substantially reduces answer bias (over 6x) and preserves benign content (62% more benign passages retrieved) while maintaining retrieval performance. The work highlights the importance of viewpoint-aware retrieval and provides a practical defense that can be integrated with existing RAG systems.

Abstract

This paper explores attacks and defenses on vector databases in retrieval-augmented generation (RAG) systems. Prior work on knowledge poisoning attacks primarily inject false or toxic content, which fact-checking or linguistic analysis easily detects. We reveal a new and subtle threat: bias injection attacks, which insert factually correct yet semantically biased passages into the knowledge base to covertly influence the ideological framing of answers generated by large language models (LLMs). We demonstrate that these adversarial passages, though linguistically coherent and truthful, can systematically crowd out opposing views from the retrieved context and steer LLM answers toward the attacker's intended perspective. We precisely characterize this class of attacks and then develop a post-retrieval filtering defense, BiasDef. We construct a comprehensive benchmark based on public question answering datasets to evaluate them. Our results show that: (1) the proposed attack induces significant perspective shifts in LLM answers, effectively evading existing retrieval-based sanitization defenses; and (2) BiasDef outperforms existing methods by reducing adversarial passages retrieved by 15\% which mitigates perspective shift by 6.2\times in answers, while enabling the retrieval of 62\% more benign passages.

Bias Injection Attacks on RAG Databases and Sanitization Defenses

TL;DR

This work identifies bias injection as a covert threat in retrieval-augmented generation, where factually correct passages with biased viewpoints can skew retrieved context and LLM outputs. It formalizes the attack using similarity and polarization metrics, and demonstrates an automated workflow to generate adversarial passages that evade fingerprint-based defenses. The authors propose BiasDef, a post-retrieval, KL-divergence-based filter that operates in a 2D SS-PS space to detect and remove adversarial content without modifying the LLM. Empirical results across multiple LLMs and datasets show that BiasDef substantially reduces answer bias (over 6x) and preserves benign content (62% more benign passages retrieved) while maintaining retrieval performance. The work highlights the importance of viewpoint-aware retrieval and provides a practical defense that can be integrated with existing RAG systems.

Abstract

This paper explores attacks and defenses on vector databases in retrieval-augmented generation (RAG) systems. Prior work on knowledge poisoning attacks primarily inject false or toxic content, which fact-checking or linguistic analysis easily detects. We reveal a new and subtle threat: bias injection attacks, which insert factually correct yet semantically biased passages into the knowledge base to covertly influence the ideological framing of answers generated by large language models (LLMs). We demonstrate that these adversarial passages, though linguistically coherent and truthful, can systematically crowd out opposing views from the retrieved context and steer LLM answers toward the attacker's intended perspective. We precisely characterize this class of attacks and then develop a post-retrieval filtering defense, BiasDef. We construct a comprehensive benchmark based on public question answering datasets to evaluate them. Our results show that: (1) the proposed attack induces significant perspective shifts in LLM answers, effectively evading existing retrieval-based sanitization defenses; and (2) BiasDef outperforms existing methods by reducing adversarial passages retrieved by 15\% which mitigates perspective shift by 6.2\times in answers, while enabling the retrieval of 62\% more benign passages.

Paper Structure

This paper contains 29 sections, 16 equations, 12 figures, 9 tables, 1 algorithm.

Figures (12)

  • Figure 1: Attack and defense in a typical RAG system. A corpus of passages is embedded and stored in a vector database. Given a query, the retriever returns the top-$k$ most relevant passages, which are then combined with the query and passed to the generator to produce the final output. The attacker injects adversarial passages into the corpus to corrupt the knowledge base, thereby influencing the contextual passages and the generator’s output. The defender aims to detect and filter out these adversarial passages.
  • Figure 1: Among the 4,520 adversarial passages generated by our workflow for 452 queries in $\text{WIKI-BALANCE}$DUO, more than 74% satisfy both Property 1 and Property 2.
  • Figure 2: An example from real LLM responses: When adversarial passages are included in the context, the answer can diverge significantly from the one produced using only benign passages. See Appendix \ref{['app:biasexamples']} for more examples.
  • Figure 3: Illustration of state-of-the-art retrieval strategies in the embedding space. (a) As indicated by the red arrows, MMR MMR and SMART SMART-RAG tend to favor diversity by selecting passages that are not necessarily the most similar to the query. (b) BRRA biasamplify retrieves passages relevant to both the original query and its noise-perturbed variants, then re-ranks them. As a result, benign passages near the perturbed queries (shown as green circles) may be retrieved even if they are distant from the original query.
  • Figure 4: Average PS shift of the top-5 retrieved passages, expressed as a percentage of the unattacked Avg. $|\text{PS}|$.
  • ...and 7 more figures

Theorems & Definitions (2)

  • proof
  • proof