Table of Contents
Fetching ...

The RAG Paradox: A Black-Box Attack Exploiting Unintentional Vulnerabilities in Retrieval-Augmented Generation Systems

Chanwoo Choi, Jinsoo Kim, Sukmin Cho, Soyeong Jeong, Buru Chang

TL;DR

The paper investigates a fundamental security tension in retrieval-augmented generation (RAG) systems, where transparent access to retrieved documents and their sources can be exploited by attackers. It introduces PARADOX, a black-box attack that infers a retriever's implicit preferences from publicly exposed content and auto-generates poison documents that are both highly retrievable and natural-looking. Offline experiments show this approach degrades RAG performance across dense and sparse retrievers and multiple datasets, while online experiments demonstrate feasibility against commercial systems. The work highlights a transparency-versus-security dilemma in deployed RAG systems and advocates defense strategies such as anomaly detection, retrieval filtering, and output auditing to mitigate such exploits.

Abstract

With the growing adoption of retrieval-augmented generation (RAG) systems, various attack methods have been proposed to degrade their performance. However, most existing approaches rely on unrealistic assumptions in which external attackers have access to internal components such as the retriever. To address this issue, we introduce a realistic black-box attack based on the RAG paradox, a structural vulnerability arising from the system's effort to enhance trust by revealing both the retrieved documents and their sources to users. This transparency enables attackers to observe which sources are used and how information is phrased, allowing them to craft poisoned documents that are more likely to be retrieved and upload them to the identified sources. Moreover, as RAG systems directly provide retrieved content to users, these documents must not only be retrievable but also appear natural and credible to maintain user confidence in the search results. Unlike prior work that focuses solely on improving document retrievability, our attack method explicitly considers both retrievability and user trust in the retrieved content. Both offline and online experiments demonstrate that our method significantly degrades system performance without internal access, while generating natural-looking poisoned documents.

The RAG Paradox: A Black-Box Attack Exploiting Unintentional Vulnerabilities in Retrieval-Augmented Generation Systems

TL;DR

The paper investigates a fundamental security tension in retrieval-augmented generation (RAG) systems, where transparent access to retrieved documents and their sources can be exploited by attackers. It introduces PARADOX, a black-box attack that infers a retriever's implicit preferences from publicly exposed content and auto-generates poison documents that are both highly retrievable and natural-looking. Offline experiments show this approach degrades RAG performance across dense and sparse retrievers and multiple datasets, while online experiments demonstrate feasibility against commercial systems. The work highlights a transparency-versus-security dilemma in deployed RAG systems and advocates defense strategies such as anomaly detection, retrieval filtering, and output auditing to mitigate such exploits.

Abstract

With the growing adoption of retrieval-augmented generation (RAG) systems, various attack methods have been proposed to degrade their performance. However, most existing approaches rely on unrealistic assumptions in which external attackers have access to internal components such as the retriever. To address this issue, we introduce a realistic black-box attack based on the RAG paradox, a structural vulnerability arising from the system's effort to enhance trust by revealing both the retrieved documents and their sources to users. This transparency enables attackers to observe which sources are used and how information is phrased, allowing them to craft poisoned documents that are more likely to be retrieved and upload them to the identified sources. Moreover, as RAG systems directly provide retrieved content to users, these documents must not only be retrievable but also appear natural and credible to maintain user confidence in the search results. Unlike prior work that focuses solely on improving document retrievability, our attack method explicitly considers both retrievability and user trust in the retrieved content. Both offline and online experiments demonstrate that our method significantly degrades system performance without internal access, while generating natural-looking poisoned documents.

Paper Structure

This paper contains 30 sections, 11 figures, 22 tables.

Figures (11)

  • Figure 1: The RAG Paradox: RAG systems reveal retrieved documents and their sources (e.g., LinkedIn, Wikipedia) used in response generation to enhance output credibility. However, this transparency creates critical vulnerabilities. Our Pilot Study: To verify that exposing sources can serve as a vulnerability and entry point for attacks, we conduct a pilot study. We create a fake profile named Vyrelin Drosamir within the identified sources and observe that commercial RAG systems reference this profile in their generated responses. This finding demonstrates that the outputs of RAG systems can be manipulated without access to their internal components.
  • Figure 2: An overview of the new black-box RAG attack scenario based on the RAG Paradox. Our study exploits external resources disclosed by RAG systems to launch attacks without relying on insider information.
  • Figure 3: Documents generated by different attack methods in medical domain.
  • Figure 4: Additional NES Evaluation
  • Figure 5: Prompts used for Retriever Preference Analysis and Document Generation.
  • ...and 6 more figures