Table of Contents
Fetching ...

In-Context Learning Can Re-learn Forbidden Tasks

Sophie Xhonneux, David Dobre, Jian Tang, Gauthier Gidel, Dhanya Sridhar

TL;DR

The paper investigates whether in-context learning (ICL) can re-learn forbidden tasks that safety-finetuned LLMs are trained to refuse. It evaluates sentiment classification and link-based hallucination tasks across Vicuna-7B, Starling-7B, and Llama2-7B, showing that ICL can undermine safety guardrails for some models, while Llama2-7B demonstrates stronger resilience. The authors introduce In-Chat ICL, a prompt-injection-like variant, which significantly enhances attack success on two models, and they provide qualitative and log-probability analyses to characterize the effect. The work highlights that safety defenses must account for ICL as a potential attack vector and suggests that model-specific training can influence vulnerability, guiding future safety research and deployment considerations.

Abstract

Despite significant investment into safety training, large language models (LLMs) deployed in the real world still suffer from numerous vulnerabilities. One perspective on LLM safety training is that it algorithmically forbids the model from answering toxic or harmful queries. To assess the effectiveness of safety training, in this work, we study forbidden tasks, i.e., tasks the model is designed to refuse to answer. Specifically, we investigate whether in-context learning (ICL) can be used to re-learn forbidden tasks despite the explicit fine-tuning of the model to refuse them. We first examine a toy example of refusing sentiment classification to demonstrate the problem. Then, we use ICL on a model fine-tuned to refuse to summarise made-up news articles. Finally, we investigate whether ICL can undo safety training, which could represent a major security risk. For the safety task, we look at Vicuna-7B, Starling-7B, and Llama2-7B. We show that the attack works out-of-the-box on Starling-7B and Vicuna-7B but fails on Llama2-7B. Finally, we propose an ICL attack that uses the chat template tokens like a prompt injection attack to achieve a better attack success rate on Vicuna-7B and Starling-7B. Trigger Warning: the appendix contains LLM-generated text with violence, suicide, and misinformation.

In-Context Learning Can Re-learn Forbidden Tasks

TL;DR

The paper investigates whether in-context learning (ICL) can re-learn forbidden tasks that safety-finetuned LLMs are trained to refuse. It evaluates sentiment classification and link-based hallucination tasks across Vicuna-7B, Starling-7B, and Llama2-7B, showing that ICL can undermine safety guardrails for some models, while Llama2-7B demonstrates stronger resilience. The authors introduce In-Chat ICL, a prompt-injection-like variant, which significantly enhances attack success on two models, and they provide qualitative and log-probability analyses to characterize the effect. The work highlights that safety defenses must account for ICL as a potential attack vector and suggests that model-specific training can influence vulnerability, guiding future safety research and deployment considerations.

Abstract

Despite significant investment into safety training, large language models (LLMs) deployed in the real world still suffer from numerous vulnerabilities. One perspective on LLM safety training is that it algorithmically forbids the model from answering toxic or harmful queries. To assess the effectiveness of safety training, in this work, we study forbidden tasks, i.e., tasks the model is designed to refuse to answer. Specifically, we investigate whether in-context learning (ICL) can be used to re-learn forbidden tasks despite the explicit fine-tuning of the model to refuse them. We first examine a toy example of refusing sentiment classification to demonstrate the problem. Then, we use ICL on a model fine-tuned to refuse to summarise made-up news articles. Finally, we investigate whether ICL can undo safety training, which could represent a major security risk. For the safety task, we look at Vicuna-7B, Starling-7B, and Llama2-7B. We show that the attack works out-of-the-box on Starling-7B and Vicuna-7B but fails on Llama2-7B. Finally, we propose an ICL attack that uses the chat template tokens like a prompt injection attack to achieve a better attack success rate on Vicuna-7B and Starling-7B. Trigger Warning: the appendix contains LLM-generated text with violence, suicide, and misinformation.
Paper Structure (38 sections, 1 equation, 7 figures, 4 tables)

This paper contains 38 sections, 1 equation, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Attack success rate with increasing number of tokens in the in-context examples. Labels at each marker show the number of examples used.
  • Figure 2: AdvBench examples
  • Figure 3: Maximum semantic similarity to in-context example. Success/fails are obtained by checking for the presence of negative keywords (such as I'm sorry) in the output generated by the Vicuna-7b v1.5 model.
  • Figure 4: Log probability of affirmative response during ICL vs no examples. Values above 0 mean that the affirmative response of the model to the harmful request became more likely. Below 0 it means the affirmative response became less likely.
  • Figure 5: Sentiment classification. Fine-tuned with DPO and $\beta=0.25$ In Subfigure (a) the model has, the model successfully refuses to answer sentiment classification queries, in (b) we show the distribution of model predictions after the ICL attack, in (c) we show whether the model prediction's were correct.
  • ...and 2 more figures