Table of Contents
Fetching ...

The Silent Saboteur: Imperceptible Adversarial Attacks against Black-Box Retrieval-Augmented Generation Systems

Hongru Song, Yu-an Liu, Ruqing Zhang, Jiafeng Guo, Jianming Lv, Maarten de Rijke, Xueqi Cheng

TL;DR

ReGENT is proposed, a reinforcement learning-based framework that tracks interactions between the attacker and the target RAG and continuously refines attack strategies based on relevance-generation-naturalness rewards and significantly outperforms existing attack methods in misleading RAG systems with small imperceptible text perturbations.

Abstract

We explore adversarial attacks against retrieval-augmented generation (RAG) systems to identify their vulnerabilities. We focus on generating human-imperceptible adversarial examples and introduce a novel imperceptible retrieve-to-generate attack against RAG. This task aims to find imperceptible perturbations that retrieve a target document, originally excluded from the initial top-$k$ candidate set, in order to influence the final answer generation. To address this task, we propose ReGENT, a reinforcement learning-based framework that tracks interactions between the attacker and the target RAG and continuously refines attack strategies based on relevance-generation-naturalness rewards. Experiments on newly constructed factual and non-factual question-answering benchmarks demonstrate that ReGENT significantly outperforms existing attack methods in misleading RAG systems with small imperceptible text perturbations.

The Silent Saboteur: Imperceptible Adversarial Attacks against Black-Box Retrieval-Augmented Generation Systems

TL;DR

ReGENT is proposed, a reinforcement learning-based framework that tracks interactions between the attacker and the target RAG and continuously refines attack strategies based on relevance-generation-naturalness rewards and significantly outperforms existing attack methods in misleading RAG systems with small imperceptible text perturbations.

Abstract

We explore adversarial attacks against retrieval-augmented generation (RAG) systems to identify their vulnerabilities. We focus on generating human-imperceptible adversarial examples and introduce a novel imperceptible retrieve-to-generate attack against RAG. This task aims to find imperceptible perturbations that retrieve a target document, originally excluded from the initial top- candidate set, in order to influence the final answer generation. To address this task, we propose ReGENT, a reinforcement learning-based framework that tracks interactions between the attacker and the target RAG and continuously refines attack strategies based on relevance-generation-naturalness rewards. Experiments on newly constructed factual and non-factual question-answering benchmarks demonstrate that ReGENT significantly outperforms existing attack methods in misleading RAG systems with small imperceptible text perturbations.

Paper Structure

This paper contains 25 sections, 11 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Overview of IRG-Attack task
  • Figure 2: The overall framework of ReGENT.
  • Figure 3: Performance comparison between ReGENT and other word substitution attacks in factual QA (left) and stance-based QA (right) scenarios.
  • Figure 4: ASR comparison before and after adding naturalness constraints. Left: factual QA; Right: stance-based QA. The three groups of bars from left to right in each subplot represent ReGENT, Naive Prompt Attack (NPA), and Prompt Hijacking Attack (PHA).
  • Figure 5: Overview of surrogate retrieval model training process.
  • ...and 3 more figures