Table of Contents
Fetching ...

Prompt Perturbation in Retrieval-Augmented Generation based Large Language Models

Zhibo Hu, Chen Wang, Yanfeng Shu, Helen, Paik, Liming Zhu

TL;DR

The paper investigates robustness of Retrieval-Augmented Generation (RAG) LLMs to prompt perturbations and shows that small prefixes can steer retrieval toward targeted incorrect passages. It introduces Gradient Guided Prompt Perturbation (GGPP) to craft prefixes that align the perturbed query embedding $\mathbf{e}_{u'} = \mathcal{M}(a || u)$ with a target passage embedding $\mathbf{e'}$, while pushing the original passage away. It also develops two detectors, SATe and ACT, to detect GGPP-induced perturbations by examining neuron activations, particularly in the last layer, and demonstrates their effectiveness across open-source LLMs. Experiments across multiple datasets and embedding models show high perturbation success rates for GGPP and robust detection performance, with ACT offering a favorable efficiency-profile for deployment. The findings highlight important robustness considerations for RAG systems and provide practical guardrails to defend against retrieval-level adversarial prompts.

Abstract

The robustness of large language models (LLMs) becomes increasingly important as their use rapidly grows in a wide range of domains. Retrieval-Augmented Generation (RAG) is considered as a means to improve the trustworthiness of text generation from LLMs. However, how the outputs from RAG-based LLMs are affected by slightly different inputs is not well studied. In this work, we find that the insertion of even a short prefix to the prompt leads to the generation of outputs far away from factually correct answers. We systematically evaluate the effect of such prefixes on RAG by introducing a novel optimization technique called Gradient Guided Prompt Perturbation (GGPP). GGPP achieves a high success rate in steering outputs of RAG-based LLMs to targeted wrong answers. It can also cope with instructions in the prompts requesting to ignore irrelevant context. We also exploit LLMs' neuron activation difference between prompts with and without GGPP perturbations to give a method that improves the robustness of RAG-based LLMs through a highly effective detector trained on neuron activation triggered by GGPP generated prompts. Our evaluation on open-sourced LLMs demonstrates the effectiveness of our methods.

Prompt Perturbation in Retrieval-Augmented Generation based Large Language Models

TL;DR

The paper investigates robustness of Retrieval-Augmented Generation (RAG) LLMs to prompt perturbations and shows that small prefixes can steer retrieval toward targeted incorrect passages. It introduces Gradient Guided Prompt Perturbation (GGPP) to craft prefixes that align the perturbed query embedding with a target passage embedding , while pushing the original passage away. It also develops two detectors, SATe and ACT, to detect GGPP-induced perturbations by examining neuron activations, particularly in the last layer, and demonstrates their effectiveness across open-source LLMs. Experiments across multiple datasets and embedding models show high perturbation success rates for GGPP and robust detection performance, with ACT offering a favorable efficiency-profile for deployment. The findings highlight important robustness considerations for RAG systems and provide practical guardrails to defend against retrieval-level adversarial prompts.

Abstract

The robustness of large language models (LLMs) becomes increasingly important as their use rapidly grows in a wide range of domains. Retrieval-Augmented Generation (RAG) is considered as a means to improve the trustworthiness of text generation from LLMs. However, how the outputs from RAG-based LLMs are affected by slightly different inputs is not well studied. In this work, we find that the insertion of even a short prefix to the prompt leads to the generation of outputs far away from factually correct answers. We systematically evaluate the effect of such prefixes on RAG by introducing a novel optimization technique called Gradient Guided Prompt Perturbation (GGPP). GGPP achieves a high success rate in steering outputs of RAG-based LLMs to targeted wrong answers. It can also cope with instructions in the prompts requesting to ignore irrelevant context. We also exploit LLMs' neuron activation difference between prompts with and without GGPP perturbations to give a method that improves the robustness of RAG-based LLMs through a highly effective detector trained on neuron activation triggered by GGPP generated prompts. Our evaluation on open-sourced LLMs demonstrates the effectiveness of our methods.
Paper Structure (26 sections, 10 equations, 17 figures, 8 tables, 2 algorithms)

This paper contains 26 sections, 10 equations, 17 figures, 8 tables, 2 algorithms.

Figures (17)

  • Figure 1: Cases of robustness in LLMs (Mistral-7B-v0.1): the text in red font represents the adversarial prefix, and the text in blue boxes are the retrieved passages. (a) The LLM generates a wrong answer with the prefix; (b) The RAG-based LLM corrects the factual error; (c) The prefix generated by our method triggers a factual error in answers, even with RAG.
  • Figure 2: The GGPP workflow: the top shows how the prefix affects the top-$k$ retrieval result. The text and arrows in red indicate perturbation, altering the ranking of orginal correct and targeted incorrect passages. The bottom shows the prefix optimization process.
  • Figure 3: Casual trace of GPT-J -- attentions only: left -- w/o GGPP; right -- w/ GGPP.
  • Figure 4: Casual trace of GPT-J -- MLP states only: left -- w/o GGPP; right -- w/ GGPP.
  • Figure 5: Casual trace of GPT-J -- all hidden states: left -- w/o GGPP; right -- w/ GGPP.
  • ...and 12 more figures