Universal Jailbreak Suffixes Are Strong Attention Hijackers

Matan Ben-Tov; Mor Geva; Mahmood Sharif

Universal Jailbreak Suffixes Are Strong Attention Hijackers

Matan Ben-Tov, Mor Geva, Mahmood Sharif

TL;DR

Suffix-based jailbreaks exploit a shallow information-flow from adversarial suffixes to the chat-building stage in LLMs, with universality arising from stronger context hijacking of the final tokens before generation. By formalizing a dominance metric and demonstrating a causal, patchable pathway, the work links suffix universality to hijacking strength and shows how to both enhance and suppress this behavior with minimal utility impact. The paper introduces practical techniques—HijEnh to boost universality, Hij.Suppr to mitigate attacks, and Hij.Detect for quick identification—grounded in mechanistic insights, and validates them across multiple models and datasets. These findings advance understanding of jailbreak mechanics and offer concrete, efficient defenses and detection strategies for safer LLM deployment.

Abstract

We study suffix-based jailbreaks$\unicode{x2013}$a powerful family of attacks against large language models (LLMs) that optimize adversarial suffixes to circumvent safety alignment. Focusing on the widely used foundational GCG attack, we observe that suffixes vary in efficacy: some are markedly more universal$\unicode{x2013}$generalizing to many unseen harmful instructions$\unicode{x2013}$than others. We first show that a shallow, critical mechanism drives GCG's effectiveness. This mechanism builds on the information flow from the adversarial suffix to the final chat template tokens before generation. Quantifying the dominance of this mechanism during generation, we find GCG irregularly and aggressively hijacks the contextualization process. Crucially, we tie hijacking to the universality phenomenon, with more universal suffixes being stronger hijackers. Subsequently, we show that these insights have practical implications: GCG's universality can be efficiently enhanced (up to $\times$5 in some cases) at no additional computational cost, and can also be surgically mitigated, at least halving the attack's success with minimal utility loss. We release our code and data at http://github.com/matanbt/interp-jailbreak.

Universal Jailbreak Suffixes Are Strong Attention Hijackers

TL;DR

Abstract

We study suffix-based jailbreaks

a powerful family of attacks against large language models (LLMs) that optimize adversarial suffixes to circumvent safety alignment. Focusing on the widely used foundational GCG attack, we observe that suffixes vary in efficacy: some are markedly more universal

generalizing to many unseen harmful instructions

than others. We first show that a shallow, critical mechanism drives GCG's effectiveness. This mechanism builds on the information flow from the adversarial suffix to the final chat template tokens before generation. Quantifying the dominance of this mechanism during generation, we find GCG irregularly and aggressively hijacks the contextualization process. Crucially, we tie hijacking to the universality phenomenon, with more universal suffixes being stronger hijackers. Subsequently, we show that these insights have practical implications: GCG's universality can be efficiently enhanced (up to

5 in some cases) at no additional computational cost, and can also be surgically mitigated, at least halving the attack's success with minimal utility loss. We release our code and data at http://github.com/matanbt/interp-jailbreak.

Universal Jailbreak Suffixes Are Strong Attention Hijackers

TL;DR

Abstract

Universal Jailbreak Suffixes Are Strong Attention Hijackers

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (19)