Hijacking Large Language Models via Adversarial In-Context Learning

Xiangyu Zhou; Yao Qiang; Saleh Zare Zade; Prashant Khanduri; Dongxiao Zhu

Hijacking Large Language Models via Adversarial In-Context Learning

Xiangyu Zhou, Yao Qiang, Saleh Zare Zade, Prashant Khanduri, Dongxiao Zhu

TL;DR

This work reveals a critical vulnerability in in-context learning where imperceptible adversarial suffixes appended to demos can hijack LLM outputs. It introduces Gradient-guided Injection (GGI), a gradient-based optimization approach that crafts bespoke suffixes to manipulate task behavior across classification and jailbreak settings, with strong transferability to different demo sets and models. To counter these threats, the authors propose a lightweight test-time defense that augments prompts with clean demonstrations, improving robustness without retraining. Empirical results show substantial attack efficacy on diverse models and tasks, alongside effective defense performance, underscoring the need for robust ICL security research and practical mitigation strategies.

Abstract

In-context learning (ICL) has emerged as a powerful paradigm leveraging LLMs for specific downstream tasks by utilizing labeled examples as demonstrations (demos) in the preconditioned prompts. Despite its promising performance, crafted adversarial attacks pose a notable threat to the robustness of LLMs. Existing attacks are either easy to detect, require a trigger in user input, or lack specificity towards ICL. To address these issues, this work introduces a novel transferable prompt injection attack against ICL, aiming to hijack LLMs to generate the target output or elicit harmful responses. In our threat model, the hacker acts as a model publisher who leverages a gradient-based prompt search method to learn and append imperceptible adversarial suffixes to the in-context demos via prompt injection. We also propose effective defense strategies using a few shots of clean demos, enhancing the robustness of LLMs during ICL. Extensive experimental results across various classification and jailbreak tasks demonstrate the effectiveness of the proposed attack and defense strategies. This work highlights the significant security vulnerabilities of LLMs during ICL and underscores the need for further in-depth studies.

Hijacking Large Language Models via Adversarial In-Context Learning

TL;DR

Abstract

Paper Structure (29 sections, 6 equations, 10 figures, 7 tables)

This paper contains 29 sections, 6 equations, 10 figures, 7 tables.

Introduction
Preliminaries
ICL Formulation
Adversarial Attack on LLMs
The Threat Model
LLM Hijacking Attack During ICL
Adversarial Capacity and Objective
Our Gradient-guided Injection Attack
The Defense Method
Experiment Setup
Result and Discussion
Performance of Clean ICL
Performance of Hijacking ICL in Classification Task
Performance of Hijacked ICL in Jailbreak Tasks
Transferability and Stealthiness of GGI
...and 14 more sections

Figures (10)

Figure 1: Illustrations of hijacking attack during ICL for Classification and Question Answering tasks. (a) Benign in-context demos work well. In our threat model, the adv. model publisher can learn adversarial suffixes like 'For' and 'refresh Real result' and append them to the in-context demos (Adv-Demos) from system-side for (b) hijacking LLMs to generate the target response, e.g., the 'positive' sentiment or (c) eliciting harmful responses. e.g., 'Sure, here are some step-by-step instructions for how to hack into a secure network …'.
Figure 2: Transferability of GGI across different demo sets and different datasets of the same task. The solid and striped bars indicate the demos are from SST-2 and RT, respectively. Different colors represent test queries from different datasets.
Figure 3: Average perplexity scores from LLaMA-13b under 4-shots setting of RT derived from three separate runs under various attacks.
Figure 4: Impact of LLM size on adversarial robustness. ASRs on the AG's News topic generation task using different sizes of OPT models, i.e., OPT-2.7b and OPT-6.7b, with two different few-shot settings.
Figure 5: An illustration of the learning objective values during iterations among different attacks on SST2 using GPT2-XL with 8-shots.
...and 5 more figures

Hijacking Large Language Models via Adversarial In-Context Learning

TL;DR

Abstract

Hijacking Large Language Models via Adversarial In-Context Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (10)