Table of Contents
Fetching ...

PrisonBreak: Jailbreaking Large Language Models with at Most Twenty-Five Targeted Bit-flips

Zachary Coalson, Jeonghyun Woo, Chris S. Lin, Joyce Qu, Yu Sun, Shiyang Chen, Lishan Yang, Gururaj Saileshwar, Prashant Nair, Bo Fang, Sanghyun Hong

TL;DR

This work reveals a practical vulnerability in safety-aligned large language models: jailbreaking via targeted bit-flips in memory using Rowhammer. It introduces PrisonBreak, an offline-online attack combining a progressive bit-search with a proxy-harmful-response objective to identify a tiny set (5–25) of critical bits, achieving 80–98% ASR across 10 open-source LLMs with minimal utility loss. The authors extend the evaluation to end-to-end exploitation on GDDR6 GPUs, achieving 69–91% ASR with only two physical bit locations, and provide structural/behavioral analyses to pinpoint vulnerable components and distinct jailbreak dynamics. They also discuss countermeasures, illustrating that many defenses offer limited resilience against adaptive, hardware-aware attackers. The findings underscore a pressing need for integrated protections spanning memory hardware, weight representations, and alignment safeguards to mitigate such runtime, parameter-level attacks in real-world MLaaS deployments.

Abstract

We study a new vulnerability in commercial-scale safety-aligned large language models (LLMs): their refusal to generate harmful responses can be broken by flipping only a few bits in model parameters. Our attack jailbreaks billion-parameter language models with just 5 to 25 bit-flips, requiring up to 40$\times$ fewer bit flips than prior attacks on much smaller computer vision models. Unlike prompt-based jailbreaks, our method directly uncensors models in memory at runtime, enabling harmful outputs without requiring input-level modifications. Our key innovation is an efficient bit-selection algorithm that identifies critical bits for language model jailbreaks up to 20$\times$ faster than prior methods. We evaluate our attack on 10 open-source LLMs, achieving high attack success rates (ASRs) of 80-98% with minimal impact on model utility. We further demonstrate an end-to-end exploit via Rowhammer-based fault injection, reliably jailbreaking 5 models (69-91% ASR) on a GDDR6 GPU. Our analyses reveal that: (1) models with weaker post-training alignment require fewer bit-flips to jailbreak; (2) certain model components, e.g., value projection layers, are substantially more vulnerable; and (3) the attack is mechanistically different from existing jailbreak methods. We evaluate potential countermeasures and find that our attack remains effective against defenses at various stages of the LLM pipeline.

PrisonBreak: Jailbreaking Large Language Models with at Most Twenty-Five Targeted Bit-flips

TL;DR

This work reveals a practical vulnerability in safety-aligned large language models: jailbreaking via targeted bit-flips in memory using Rowhammer. It introduces PrisonBreak, an offline-online attack combining a progressive bit-search with a proxy-harmful-response objective to identify a tiny set (5–25) of critical bits, achieving 80–98% ASR across 10 open-source LLMs with minimal utility loss. The authors extend the evaluation to end-to-end exploitation on GDDR6 GPUs, achieving 69–91% ASR with only two physical bit locations, and provide structural/behavioral analyses to pinpoint vulnerable components and distinct jailbreak dynamics. They also discuss countermeasures, illustrating that many defenses offer limited resilience against adaptive, hardware-aware attackers. The findings underscore a pressing need for integrated protections spanning memory hardware, weight representations, and alignment safeguards to mitigate such runtime, parameter-level attacks in real-world MLaaS deployments.

Abstract

We study a new vulnerability in commercial-scale safety-aligned large language models (LLMs): their refusal to generate harmful responses can be broken by flipping only a few bits in model parameters. Our attack jailbreaks billion-parameter language models with just 5 to 25 bit-flips, requiring up to 40 fewer bit flips than prior attacks on much smaller computer vision models. Unlike prompt-based jailbreaks, our method directly uncensors models in memory at runtime, enabling harmful outputs without requiring input-level modifications. Our key innovation is an efficient bit-selection algorithm that identifies critical bits for language model jailbreaks up to 20 faster than prior methods. We evaluate our attack on 10 open-source LLMs, achieving high attack success rates (ASRs) of 80-98% with minimal impact on model utility. We further demonstrate an end-to-end exploit via Rowhammer-based fault injection, reliably jailbreaking 5 models (69-91% ASR) on a GDDR6 GPU. Our analyses reveal that: (1) models with weaker post-training alignment require fewer bit-flips to jailbreak; (2) certain model components, e.g., value projection layers, are substantially more vulnerable; and (3) the attack is mechanistically different from existing jailbreak methods. We evaluate potential countermeasures and find that our attack remains effective against defenses at various stages of the LLM pipeline.

Paper Structure

This paper contains 30 sections, 5 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: PrisonBreak workflow. An illustration of the steps in our offline and online procedures.
  • Figure 2: Impact of attack configurations on ASR and Acc. for Llama2-7B.
  • Figure 3: Structural analysis results. The proportion of target bits across bit locations, layer types, and layer positions of Llama2-7b. We consider the top 1% of target bits evaluated by our offline procedure across 25 successive iterations.
  • Figure 4: Behavioral analysis results. For each attack: the cosine similarity between HarmBench activations and the refusal direction across all layers of Vicuna-13b-v1.5 (left) and the UMAP visualization of activations from Alpaca (middle) and HarmBench (right) from the last layer of Qwen2-7b.
  • Figure 5: Full activation visualization results. The decomposed activations of Qwen2-7B on Alpaca and HarmBench, across various layers.
  • ...and 2 more figures

Theorems & Definitions (2)

  • Definition 5.1: Jailbreaking Score
  • Definition 5.2: Utility Score