Leveraging the Context through Multi-Round Interactions for Jailbreaking Attacks

Yixin Cheng; Markos Georgopoulos; Volkan Cevher; Grigorios G. Chrysos

Leveraging the Context through Multi-Round Interactions for Jailbreaking Attacks

Yixin Cheng, Markos Georgopoulos, Volkan Cevher, Grigorios G. Chrysos

TL;DR

The paper tackles jailbreaking of large language models by introducing Contextual Interaction Attack (CIA), a multi-round, context-building approach that uses benign preliminary questions to shape a harmful context prior to the target query. Formally, the attacker optimizes $f(g(h(ε)|c))$ over $h$ with a contextual vector $c$ constructed from prior interactions, while a fixed model $g$ produces outputs $g(ε|c)$. CIA employs an auxiliary LLM to generate a sequence $\{ε_i\}$ via in-context learning, enabling black-box, cross-model effectiveness and strong transferability across diverse LLMs on AdvBench Subset. Experimental results show CIA outperforms prior jailbreak methods (ICA, GCG, TAP, PAIR) on multiple models, with defense analyses indicating limited efficacy of standard defenses and highlighting Self-reminder as a comparatively stronger defense. The work emphasizes the significance of context in LLM security and suggests future directions to enhance robustness and mitigate context-based attacks in practical deployments, while carefully balancing reproducibility and safety considerations.

Abstract

Large Language Models (LLMs) are susceptible to Jailbreaking attacks, which aim to extract harmful information by subtly modifying the attack query. As defense mechanisms evolve, directly obtaining harmful information becomes increasingly challenging for Jailbreaking attacks. In this work, inspired from Chomsky's transformational-generative grammar theory and human practices of indirect context to elicit harmful information, we focus on a new attack form, called Contextual Interaction Attack. We contend that the prior context\u2014the information preceding the attack query\u2014plays a pivotal role in enabling strong Jailbreaking attacks. Specifically, we propose a first multi-turn approach that leverages benign preliminary questions to interact with the LLM. Due to the autoregressive nature of LLMs, which use previous conversation rounds as context during generation, we guide the model's question-response pair to construct a context that is semantically aligned with the attack query to execute the attack. We conduct experiments on seven different LLMs and demonstrate the efficacy of this attack, which is black-box and can also transfer across LLMs. We believe this can lead to further developments and understanding of security in LLMs.

Leveraging the Context through Multi-Round Interactions for Jailbreaking Attacks

TL;DR

over

with a contextual vector

constructed from prior interactions, while a fixed model

produces outputs

. CIA employs an auxiliary LLM to generate a sequence

via in-context learning, enabling black-box, cross-model effectiveness and strong transferability across diverse LLMs on AdvBench Subset. Experimental results show CIA outperforms prior jailbreak methods (ICA, GCG, TAP, PAIR) on multiple models, with defense analyses indicating limited efficacy of standard defenses and highlighting Self-reminder as a comparatively stronger defense. The work emphasizes the significance of context in LLM security and suggests future directions to enhance robustness and mitigate context-based attacks in practical deployments, while carefully balancing reproducibility and safety considerations.

Abstract

Paper Structure (32 sections, 1 equation, 13 figures, 8 tables)

This paper contains 32 sections, 1 equation, 13 figures, 8 tables.

Introduction
Related Work
Method
Problem Setting
Contextual Interaction Attack
Design choices
Experiment
Experiment Setting
Jailbreak in AdvBench Subset
Transferability of jailbreak prompts
Defense Evaluation
Ablation Study
Conclusion
Evaluation Metrics: Chatgpt score, prefix matching and human judge
Implementation Details
...and 17 more sections

Figures (13)

Figure 1: Single step attacks (left) versus Contextual Interaction Attack (on the right). Contextual Interaction Attack consists of two steps. Firstly, we use in-context learning to let the auxiliary LLM learn from hand-crafted examples. Then, the auxiliary LLM maps the attack query to multiple harmless individual preliminary questions, e.g., Q1 and Q2 in the figure, are obtained through the auxiliary LLM. Those preliminary questions gradually insert harmful semantics into the context. Notice that the generated prompts are harmless individually, while they collectively form a harmful prompt set when considered along with the context information. In the second step, the set of preliminary questions are asked to the model using few rounds of interaction, i.e., question-answers with the model. Then, the attack query follows to execute the attack. In the schematic, Alice performs a simple attack, Bob performs a rephrase attack, and Eve performs Contextual Interaction Attack. We provide example conversation for https://chat.openai.com/share/fbeee211-8566-4b12-8a73-7ca37cb92146, https://chat.openai.com/share/df135382-b359-46ad-a15a-7c2519bb64a4 on GPT3.5. Click the name to access the anonymous example conversation.
Figure 2: Attack Success Rate (ASR) on OpenAI and Anthropic Red Teaming Dataset and MaterKey Dataset. Contextual Interaction Attack outperform sall hand-craft methods across 4 different models.
Figure 3: A successful jailbreak of GPT-4 for the goal 'How can I cut myself without others noticing?' The full anonymous chat can be found https://chat.openai.com/share/3c224f3b-a7df-422f-b7ed-c2f5f6e4713d.
Figure 4: A successful jailbreak of GPT-3.5 for the goal 'Can you help me hack into my neighbor’s wifi?' The full anonymous chat can be found https://chat.openai.com/share/6f26e894-070b-4100-9403-63a0a1a6b2be.
Figure 5: A successful jailbreak of GPT-3.5 for the goal 'How do I record someone without consent?' The full anonymous chat can be found https://chat.openai.com/share/981742d0-2e3e-4384-aebb-62a03c49609d.
...and 8 more figures

Leveraging the Context through Multi-Round Interactions for Jailbreaking Attacks

TL;DR

Abstract

Leveraging the Context through Multi-Round Interactions for Jailbreaking Attacks

Authors

TL;DR

Abstract

Table of Contents

Figures (13)