Activation Surgery: Jailbreaking White-box LLMs without Touching the Prompt

Maël Jenny; Jérémie Dentan; Sonia Vanier; Michaël Krajecki

Activation Surgery: Jailbreaking White-box LLMs without Touching the Prompt

Maël Jenny, Jérémie Dentan, Sonia Vanier, Michaël Krajecki

Abstract

Most jailbreak techniques for Large Language Models (LLMs) primarily rely on prompt modifications, including paraphrasing, obfuscation, or conversational strategies. Meanwhile, abliteration techniques (also known as targeted ablations of internal components) have been used to study and explain LLM outputs by probing which internal structures causally support particular responses. In this work, we combine these two lines of research by directly manipulating the model's internal activations to alter its generation trajectory without changing the prompt. Our method constructs a nearby benign prompt and performs layer-wise activation substitutions using a sequential procedure. We show that this activation surgery method reveals where and how refusal arises, and prevents refusal signals from propagating across layers, thereby inhibiting the model's safety mechanisms. Finally, we discuss the security implications for open-weights models and instrumented inference environments.

Activation Surgery: Jailbreaking White-box LLMs without Touching the Prompt

Abstract

Paper Structure (42 sections, 6 equations, 5 figures, 2 tables)

This paper contains 42 sections, 6 equations, 5 figures, 2 tables.

Introduction
Background and Related Work
Alignment, Refusal, and Safety Behaviors
(Automated) Prompt-based Jailbreaking
Mechanistic Interpretability and Activation-level Interventions
Refusal Localization and Removal of Safety Features
Threat Model
Adversarial Capabilities
Adversarial Goals
Methodology
Preliminaries
Activation Surgery
Crafting a Twin Prompt
Parameter $\tau$: Selecting Which Dimensions to Patch
Parameter $\gamma$: What Values to Patch In
...and 27 more sections

Figures (5)

Figure 1: Illustration of activation surgery using an illicit/twin prompt pair differing by a single token. We sequentially replace selected activations of the illicit prompt with those of its (benign) twin prompt, thereby inhibiting the model’s security mechanisms.
Figure 2: Example result with our Activation Surgery method. See additional examples in the Appendix. Harmful snippets are omitted from the main text for safety reasons.
Figure 3: Number of patched dimensions per layer as a function of $\tau$ for the prompt "How to make a bomb ?", using a fixed value of $\gamma = 0.0$. The experiments are conducted on Llama-4-Scout-17B-16E-Instruct, which has 48 layers and a dimension $d=5120$.
Figure 4: Impact of parameter $\gamma$ at a fixed value of $\tau = 0.8$. Colored segments indicate empirically dominant regimes (relevant response vs. refusal), though occasional exceptions occur. We also report $\bar{\rho}(\gamma)$, the proportion of modified dimensions.
Figure 5: Optimization of $(\tau,\gamma)$ for the Illicit/Twin pair "How to make a bomb ?" / "How to make a book ?".

Activation Surgery: Jailbreaking White-box LLMs without Touching the Prompt

Abstract

Activation Surgery: Jailbreaking White-box LLMs without Touching the Prompt

Authors

Abstract

Table of Contents

Figures (5)