Table of Contents
Fetching ...

Universal Jailbreak Suffixes Are Strong Attention Hijackers

Matan Ben-Tov, Mor Geva, Mahmood Sharif

TL;DR

Suffix-based jailbreaks exploit a shallow information-flow from adversarial suffixes to the chat-building stage in LLMs, with universality arising from stronger context hijacking of the final tokens before generation. By formalizing a dominance metric and demonstrating a causal, patchable pathway, the work links suffix universality to hijacking strength and shows how to both enhance and suppress this behavior with minimal utility impact. The paper introduces practical techniques—HijEnh to boost universality, Hij.Suppr to mitigate attacks, and Hij.Detect for quick identification—grounded in mechanistic insights, and validates them across multiple models and datasets. These findings advance understanding of jailbreak mechanics and offer concrete, efficient defenses and detection strategies for safer LLM deployment.

Abstract

We study suffix-based jailbreaks$\unicode{x2013}$a powerful family of attacks against large language models (LLMs) that optimize adversarial suffixes to circumvent safety alignment. Focusing on the widely used foundational GCG attack, we observe that suffixes vary in efficacy: some are markedly more universal$\unicode{x2013}$generalizing to many unseen harmful instructions$\unicode{x2013}$than others. We first show that a shallow, critical mechanism drives GCG's effectiveness. This mechanism builds on the information flow from the adversarial suffix to the final chat template tokens before generation. Quantifying the dominance of this mechanism during generation, we find GCG irregularly and aggressively hijacks the contextualization process. Crucially, we tie hijacking to the universality phenomenon, with more universal suffixes being stronger hijackers. Subsequently, we show that these insights have practical implications: GCG's universality can be efficiently enhanced (up to $\times$5 in some cases) at no additional computational cost, and can also be surgically mitigated, at least halving the attack's success with minimal utility loss. We release our code and data at http://github.com/matanbt/interp-jailbreak.

Universal Jailbreak Suffixes Are Strong Attention Hijackers

TL;DR

Suffix-based jailbreaks exploit a shallow information-flow from adversarial suffixes to the chat-building stage in LLMs, with universality arising from stronger context hijacking of the final tokens before generation. By formalizing a dominance metric and demonstrating a causal, patchable pathway, the work links suffix universality to hijacking strength and shows how to both enhance and suppress this behavior with minimal utility impact. The paper introduces practical techniques—HijEnh to boost universality, Hij.Suppr to mitigate attacks, and Hij.Detect for quick identification—grounded in mechanistic insights, and validates them across multiple models and datasets. These findings advance understanding of jailbreak mechanics and offer concrete, efficient defenses and detection strategies for safer LLM deployment.

Abstract

We study suffix-based jailbreaksa powerful family of attacks against large language models (LLMs) that optimize adversarial suffixes to circumvent safety alignment. Focusing on the widely used foundational GCG attack, we observe that suffixes vary in efficacy: some are markedly more universalgeneralizing to many unseen harmful instructionsthan others. We first show that a shallow, critical mechanism drives GCG's effectiveness. This mechanism builds on the information flow from the adversarial suffix to the final chat template tokens before generation. Quantifying the dominance of this mechanism during generation, we find GCG irregularly and aggressively hijacks the contextualization process. Crucially, we tie hijacking to the universality phenomenon, with more universal suffixes being stronger hijackers. Subsequently, we show that these insights have practical implications: GCG's universality can be efficiently enhanced (up to 5 in some cases) at no additional computational cost, and can also be surgically mitigated, at least halving the attack's success with minimal utility loss. We release our code and data at http://github.com/matanbt/interp-jailbreak.

Paper Structure

This paper contains 28 sections, 7 equations, 19 figures, 4 tables.

Figures (19)

  • Figure 1: We explore suffix-based jailbreaks on safety-aligned LLMs, which (a) append an adversarial suffix (adv) to a harmful instruction (instr) and elicit an affirmative, unsafe response. We find that (b) the final chat template tokens (chat) play a crucial part in jailbreak behavior, specifically (c) common suffix-based jailbreaks effectively hijack the chat representation with irregular strength, and (d) the more universal the suffix, the stronger the hijacking; (e) our insights enable both enhancing and mitigating these attacks.
  • Figure 2: Universality of >1K GCG suffixes on Gemma2. (a) Suffixes often generalize beyond their target instruction; (b) suffixes also enhance prefilling attacks, exceeding their explicit optimization objective and outperforming random suffixes (dashed line).
  • Figure 3: Knockout effect of edges on GCG jailbreak suffixes (dots), measured by the proportion of failed jailbreaks (Jailbreak Flip Rate). (a--b) highlight the critical role of adv$\to$chat in enabling jailbreaks, even prefilled with affirmation.
  • Figure 4: Patching the attention output at position chat+i ($x$-axis) from successful attacks to failed ones, turns the latter to successful attacks, reflecting the shallowness of GCG jailbreaks.
  • Figure 5: Quantifying the dominance of the contributors to chat[-1] (Eq. \ref{['eq:dominance-score']}), on a harmful instruction (asking how to build a bomb), when adv is set to a (a) random or (b) GCG suffix.
  • ...and 14 more figures