In-Context Learning Can Re-learn Forbidden Tasks
Sophie Xhonneux, David Dobre, Jian Tang, Gauthier Gidel, Dhanya Sridhar
TL;DR
The paper investigates whether in-context learning (ICL) can re-learn forbidden tasks that safety-finetuned LLMs are trained to refuse. It evaluates sentiment classification and link-based hallucination tasks across Vicuna-7B, Starling-7B, and Llama2-7B, showing that ICL can undermine safety guardrails for some models, while Llama2-7B demonstrates stronger resilience. The authors introduce In-Chat ICL, a prompt-injection-like variant, which significantly enhances attack success on two models, and they provide qualitative and log-probability analyses to characterize the effect. The work highlights that safety defenses must account for ICL as a potential attack vector and suggests that model-specific training can influence vulnerability, guiding future safety research and deployment considerations.
Abstract
Despite significant investment into safety training, large language models (LLMs) deployed in the real world still suffer from numerous vulnerabilities. One perspective on LLM safety training is that it algorithmically forbids the model from answering toxic or harmful queries. To assess the effectiveness of safety training, in this work, we study forbidden tasks, i.e., tasks the model is designed to refuse to answer. Specifically, we investigate whether in-context learning (ICL) can be used to re-learn forbidden tasks despite the explicit fine-tuning of the model to refuse them. We first examine a toy example of refusing sentiment classification to demonstrate the problem. Then, we use ICL on a model fine-tuned to refuse to summarise made-up news articles. Finally, we investigate whether ICL can undo safety training, which could represent a major security risk. For the safety task, we look at Vicuna-7B, Starling-7B, and Llama2-7B. We show that the attack works out-of-the-box on Starling-7B and Vicuna-7B but fails on Llama2-7B. Finally, we propose an ICL attack that uses the chat template tokens like a prompt injection attack to achieve a better attack success rate on Vicuna-7B and Starling-7B. Trigger Warning: the appendix contains LLM-generated text with violence, suicide, and misinformation.
