Table of Contents
Fetching ...

FlippedRAG: Black-Box Opinion Manipulation Adversarial Attacks to Retrieval-Augmented Generation Models

Zhuo Chen, Yuyang Gong, Jiawei Liu, Miaokun Chen, Haotan Liu, Qikai Cheng, Fan Zhang, Wei Lu, Xiaozhong Liu

TL;DR

FlippedRAG introduces a transfer-based attack that reverse-engineers a black-box retriever by training a surrogate via contrastive learning to manipulate retrieval results on controversial topics. It then crafts targeted adversarial triggers using PAT to inject into a small number of documents, driving the RAG system to generate biased outputs and shift user opinions. Empirical results show substantial improvements in opinion manipulation and user cognition shifts, while existing defenses provide limited protection. The work highlights critical security gaps in retrieval pipelines and motivates the development of robust defenses focused on retrieval integrity and data provenance.

Abstract

Retrieval-Augmented Generation (RAG) enriches LLMs by dynamically retrieving external knowledge, reducing hallucinations and satisfying real-time information needs. While existing research mainly targets RAG's performance and efficiency, emerging studies highlight critical security concerns. Yet, current adversarial approaches remain limited, mostly addressing white-box scenarios or heuristic black-box attacks without fully investigating vulnerabilities in the retrieval phase. Additionally, prior works mainly focus on factoid Q&A tasks, their attacks lack complexity and can be easily corrected by advanced LLMs. In this paper, we investigate a more realistic and critical threat scenario: adversarial attacks intended for opinion manipulation against black-box RAG models, particularly on controversial topics. Specifically, we propose FlippedRAG, a transfer-based adversarial attack against black-box RAG systems. We first demonstrate that the underlying retriever of a black-box RAG system can be reverse-engineered, enabling us to train a surrogate retriever. Leveraging the surrogate retriever, we further craft target poisoning triggers, altering vary few documents to effectively manipulate both retrieval and subsequent generation. Extensive empirical results show that FlippedRAG substantially outperforms baseline methods, improving the average attack success rate by 16.7%. FlippedRAG achieves on average a 50% directional shift in the opinion polarity of RAG-generated responses, ultimately causing a notable 20% shift in user cognition. Furthermore, we evaluate the performance of several potential defensive measures, concluding that existing mitigation strategies remain insufficient against such sophisticated manipulation attacks. These results highlight an urgent need for developing innovative defensive solutions to ensure the security and trustworthiness of RAG systems.

FlippedRAG: Black-Box Opinion Manipulation Adversarial Attacks to Retrieval-Augmented Generation Models

TL;DR

FlippedRAG introduces a transfer-based attack that reverse-engineers a black-box retriever by training a surrogate via contrastive learning to manipulate retrieval results on controversial topics. It then crafts targeted adversarial triggers using PAT to inject into a small number of documents, driving the RAG system to generate biased outputs and shift user opinions. Empirical results show substantial improvements in opinion manipulation and user cognition shifts, while existing defenses provide limited protection. The work highlights critical security gaps in retrieval pipelines and motivates the development of robust defenses focused on retrieval integrity and data provenance.

Abstract

Retrieval-Augmented Generation (RAG) enriches LLMs by dynamically retrieving external knowledge, reducing hallucinations and satisfying real-time information needs. While existing research mainly targets RAG's performance and efficiency, emerging studies highlight critical security concerns. Yet, current adversarial approaches remain limited, mostly addressing white-box scenarios or heuristic black-box attacks without fully investigating vulnerabilities in the retrieval phase. Additionally, prior works mainly focus on factoid Q&A tasks, their attacks lack complexity and can be easily corrected by advanced LLMs. In this paper, we investigate a more realistic and critical threat scenario: adversarial attacks intended for opinion manipulation against black-box RAG models, particularly on controversial topics. Specifically, we propose FlippedRAG, a transfer-based adversarial attack against black-box RAG systems. We first demonstrate that the underlying retriever of a black-box RAG system can be reverse-engineered, enabling us to train a surrogate retriever. Leveraging the surrogate retriever, we further craft target poisoning triggers, altering vary few documents to effectively manipulate both retrieval and subsequent generation. Extensive empirical results show that FlippedRAG substantially outperforms baseline methods, improving the average attack success rate by 16.7%. FlippedRAG achieves on average a 50% directional shift in the opinion polarity of RAG-generated responses, ultimately causing a notable 20% shift in user cognition. Furthermore, we evaluate the performance of several potential defensive measures, concluding that existing mitigation strategies remain insufficient against such sophisticated manipulation attacks. These results highlight an urgent need for developing innovative defensive solutions to ensure the security and trustworthiness of RAG systems.
Paper Structure (59 sections, 4 equations, 9 figures, 14 tables, 1 algorithm)

This paper contains 59 sections, 4 equations, 9 figures, 14 tables, 1 algorithm.

Figures (9)

  • Figure 1: As demonstrated by an example from our pilot study (see left box), we observed that ChatGPT can be induced through user prompts to fully replicate the context segments retrieved in its responses. Motivated by this observation, we can imitate and transparentize the retriever of the black-box RAG model by enumerating critical queries and candidates. The right part illustrates the opinion manipulation attack against black-box RAG-like systems. When queries on controversial topics are submitted to a benign black-box RAG, it typically retrieves documents reflecting diverse viewpoints and generates neutral, objective responses. Conversely, by subtly modifying a minimal number of documents using triggers generated via a surrogate retrieval model, which imitates the black-box retriever, the manipulated RAG exclusively retrieves documents endorsing a specific opinion. This constrains the LLM to generate biased responses and influences users' opinions.
  • Figure 2: The overview of FlippedRAG for manipulating the opinions of RAG-generated content under black-box setting.
  • Figure 3: The framework for obtaining ranking imitation data of retrieval model and constructing contrastive pairs in black-box RAG.
  • Figure 4: Overall opinion manipulation effect on Llama-3 with Contriever. We use color blocks to represent the distribution of different opinions and polarities corresponding to the generated content.
  • Figure 5: Distributions of log perplexity (PPL) calculated by GPT-2 on clean documents and attacked documents by FlippedRAG and the baselines.
  • ...and 4 more figures