Leveraging the Context through Multi-Round Interactions for Jailbreaking Attacks
Yixin Cheng, Markos Georgopoulos, Volkan Cevher, Grigorios G. Chrysos
TL;DR
The paper tackles jailbreaking of large language models by introducing Contextual Interaction Attack (CIA), a multi-round, context-building approach that uses benign preliminary questions to shape a harmful context prior to the target query. Formally, the attacker optimizes $f(g(h(ε)|c))$ over $h$ with a contextual vector $c$ constructed from prior interactions, while a fixed model $g$ produces outputs $g(ε|c)$. CIA employs an auxiliary LLM to generate a sequence $\{ε_i\}$ via in-context learning, enabling black-box, cross-model effectiveness and strong transferability across diverse LLMs on AdvBench Subset. Experimental results show CIA outperforms prior jailbreak methods (ICA, GCG, TAP, PAIR) on multiple models, with defense analyses indicating limited efficacy of standard defenses and highlighting Self-reminder as a comparatively stronger defense. The work emphasizes the significance of context in LLM security and suggests future directions to enhance robustness and mitigate context-based attacks in practical deployments, while carefully balancing reproducibility and safety considerations.
Abstract
Large Language Models (LLMs) are susceptible to Jailbreaking attacks, which aim to extract harmful information by subtly modifying the attack query. As defense mechanisms evolve, directly obtaining harmful information becomes increasingly challenging for Jailbreaking attacks. In this work, inspired from Chomsky's transformational-generative grammar theory and human practices of indirect context to elicit harmful information, we focus on a new attack form, called Contextual Interaction Attack. We contend that the prior context\u2014the information preceding the attack query\u2014plays a pivotal role in enabling strong Jailbreaking attacks. Specifically, we propose a first multi-turn approach that leverages benign preliminary questions to interact with the LLM. Due to the autoregressive nature of LLMs, which use previous conversation rounds as context during generation, we guide the model's question-response pair to construct a context that is semantically aligned with the attack query to execute the attack. We conduct experiments on seven different LLMs and demonstrate the efficacy of this attack, which is black-box and can also transfer across LLMs. We believe this can lead to further developments and understanding of security in LLMs.
