Table of Contents
Fetching ...

Evaluating and Safeguarding the Adversarial Robustness of Retrieval-Based In-Context Learning

Simon Yu, Jie He, Pasquale Minervini, Jeff Z. Pan

TL;DR

This study reveals that retrieval-augmented models can enhance robustness against test sample attacks, outperforming vanilla ICL with a 4.87% reduction in Attack Success Rate (ASR); however, they exhibit overconfidence in the demonstrations, leading to a 2% increase in ASR for demonstration attacks.

Abstract

With the emergence of large language models, such as LLaMA and OpenAI GPT-3, In-Context Learning (ICL) gained significant attention due to its effectiveness and efficiency. However, ICL is very sensitive to the choice, order, and verbaliser used to encode the demonstrations in the prompt. Retrieval-Augmented ICL methods try to address this problem by leveraging retrievers to extract semantically related examples as demonstrations. While this approach yields more accurate results, its robustness against various types of adversarial attacks, including perturbations on test samples, demonstrations, and retrieved data, remains under-explored. Our study reveals that retrieval-augmented models can enhance robustness against test sample attacks, outperforming vanilla ICL with a 4.87% reduction in Attack Success Rate (ASR); however, they exhibit overconfidence in the demonstrations, leading to a 2% increase in ASR for demonstration attacks. Adversarial training can help improve the robustness of ICL methods to adversarial attacks; however, such a training scheme can be too costly in the context of LLMs. As an alternative, we introduce an effective training-free adversarial defence method, DARD, which enriches the example pool with those attacked samples. We show that DARD yields improvements in performance and robustness, achieving a 15% reduction in ASR over the baselines. Code and data are released to encourage further research: https://github.com/simonucl/adv-retreival-icl

Evaluating and Safeguarding the Adversarial Robustness of Retrieval-Based In-Context Learning

TL;DR

This study reveals that retrieval-augmented models can enhance robustness against test sample attacks, outperforming vanilla ICL with a 4.87% reduction in Attack Success Rate (ASR); however, they exhibit overconfidence in the demonstrations, leading to a 2% increase in ASR for demonstration attacks.

Abstract

With the emergence of large language models, such as LLaMA and OpenAI GPT-3, In-Context Learning (ICL) gained significant attention due to its effectiveness and efficiency. However, ICL is very sensitive to the choice, order, and verbaliser used to encode the demonstrations in the prompt. Retrieval-Augmented ICL methods try to address this problem by leveraging retrievers to extract semantically related examples as demonstrations. While this approach yields more accurate results, its robustness against various types of adversarial attacks, including perturbations on test samples, demonstrations, and retrieved data, remains under-explored. Our study reveals that retrieval-augmented models can enhance robustness against test sample attacks, outperforming vanilla ICL with a 4.87% reduction in Attack Success Rate (ASR); however, they exhibit overconfidence in the demonstrations, leading to a 2% increase in ASR for demonstration attacks. Adversarial training can help improve the robustness of ICL methods to adversarial attacks; however, such a training scheme can be too costly in the context of LLMs. As an alternative, we introduce an effective training-free adversarial defence method, DARD, which enriches the example pool with those attacked samples. We show that DARD yields improvements in performance and robustness, achieving a 15% reduction in ASR over the baselines. Code and data are released to encourage further research: https://github.com/simonucl/adv-retreival-icl
Paper Structure (37 sections, 6 equations, 6 figures, 19 tables)

This paper contains 37 sections, 6 equations, 6 figures, 19 tables.

Figures (6)

  • Figure 1: Overview of the paper. We visualize our seven adversarial attacks in (a), (c) and (d) (only 3 shots are used in the plot for display purposes). And our adversarial defence method, DARD, is showcased in the top right corner (Plot (b)).
  • Figure 2: Attack Success Rate (ASR) across shots among different ICL methods. We aggregated the results for ICL and $k$NN-ICL across 3 seeds and R-ICL results across 3 retrievers.
  • Figure 3: Attack Success Rate (%) for adversarial attacks across various models, based on experiments conducted on the RTE dataset with 8-shot demonstrations. The results are based on R-ICL and the mean attack success rate among the three retrievers we used: BM25, SBERT, and Instructor.
  • Figure 4: Analysis of the attack transferability for Retrieval-ICL on larger variant models within the same family: (Left) the LLaMA family; (Mid) the Mistral family; and (Right) across models from different families. The models' orders are sorted by their parameter sizes and release date. The models highlighted in bold are the models being attacked.
  • Figure 5: Attack Success Rate (%) for adversarial attacks across various models, based on experiments conducted on the RTE dataset with 8-shot demonstrations. The results are based on ICL and involve aggregating results from 3 shots.
  • ...and 1 more figures