Table of Contents
Fetching ...

What Features in Prompts Jailbreak LLMs? Investigating the Mechanisms Behind Attacks

Nathalie Kirch, Constantin Weisser, Severin Field, Helen Yannakoudakis, Stephen Casper

TL;DR

The paper tackles the problem of understanding jailbreak mechanisms in LLMs beyond simple linear signals by introducing a large, model-agnostic jailbreak dataset and probing prompt representations with both linear and non-linear methods. It shows that jailbreak success can be predicted from prompt representations in-distribution but transfer across attack families is limited, indicating attack-type-specific, non-universal features. Non-linear probes, especially MLPs, generalize better across layers and enable causal latent-space interventions that can steer model behavior more reliably than linear approaches. This work provides a prompt-side mechanistic framework for analyzing and testing jailbreak features, highlighting the need for adaptive defenses that account for non-linear, model-specific vulnerabilities. Overall, the findings challenge the universality hypothesis of jailbreak signals and offer a foundation for mechanistic safety research in open-weight LLMs.

Abstract

Jailbreaks have been a central focus of research regarding the safety and reliability of large language models (LLMs), yet the mechanisms underlying these attacks remain poorly understood. While previous studies have predominantly relied on linear methods to detect jailbreak attempts and model refusals, we take a different approach by examining both linear and non-linear features in prompts that lead to successful jailbreaks. First, we introduce a novel dataset comprising 10,800 jailbreak attempts spanning 35 diverse attack methods. Leveraging this dataset, we train linear and non-linear probes on hidden states of open-weight LLMs to predict jailbreak success. Probes achieve strong in-distribution accuracy but transfer is attack-family-specific, revealing that different jailbreaks are supported by distinct internal mechanisms rather than a single universal direction. To establish causal relevance, we construct probe-guided latent interventions that systematically shift compliance in the predicted direction. Interventions derived from non-linear probes produce larger and more reliable effects than those from linear probes, indicating that features linked to jailbreak success are encoded non-linearly in prompt representations. Overall, the results surface heterogeneous, non-linear structure in jailbreak mechanisms and provide a prompt-side methodology for recovering and testing the features that drive jailbreak outcomes.

What Features in Prompts Jailbreak LLMs? Investigating the Mechanisms Behind Attacks

TL;DR

The paper tackles the problem of understanding jailbreak mechanisms in LLMs beyond simple linear signals by introducing a large, model-agnostic jailbreak dataset and probing prompt representations with both linear and non-linear methods. It shows that jailbreak success can be predicted from prompt representations in-distribution but transfer across attack families is limited, indicating attack-type-specific, non-universal features. Non-linear probes, especially MLPs, generalize better across layers and enable causal latent-space interventions that can steer model behavior more reliably than linear approaches. This work provides a prompt-side mechanistic framework for analyzing and testing jailbreak features, highlighting the need for adaptive defenses that account for non-linear, model-specific vulnerabilities. Overall, the findings challenge the universality hypothesis of jailbreak signals and offer a foundation for mechanistic safety research in open-weight LLMs.

Abstract

Jailbreaks have been a central focus of research regarding the safety and reliability of large language models (LLMs), yet the mechanisms underlying these attacks remain poorly understood. While previous studies have predominantly relied on linear methods to detect jailbreak attempts and model refusals, we take a different approach by examining both linear and non-linear features in prompts that lead to successful jailbreaks. First, we introduce a novel dataset comprising 10,800 jailbreak attempts spanning 35 diverse attack methods. Leveraging this dataset, we train linear and non-linear probes on hidden states of open-weight LLMs to predict jailbreak success. Probes achieve strong in-distribution accuracy but transfer is attack-family-specific, revealing that different jailbreaks are supported by distinct internal mechanisms rather than a single universal direction. To establish causal relevance, we construct probe-guided latent interventions that systematically shift compliance in the predicted direction. Interventions derived from non-linear probes produce larger and more reliable effects than those from linear probes, indicating that features linked to jailbreak success are encoded non-linearly in prompt representations. Overall, the results surface heterogeneous, non-linear structure in jailbreak mechanisms and provide a prompt-side methodology for recovering and testing the features that drive jailbreak outcomes.

Paper Structure

This paper contains 32 sections, 1 equation, 15 figures, 3 tables.

Figures (15)

  • Figure 1: Gemma-7b-it complies with a harmful request under a non-linear probe-guided latent space attack. We designed this attack using a multilayer perceptron probe trained to distinguish successful from unsuccessful jailbreaking prompts (further details in \ref{['sec:intervention']}).
  • Figure 2: A non-linear probe-guided defensive latent space perturbation makes Gemma-7b-it refuse a harmful request. We designed this attack using a multilayer perceptron (MLP) probe trained to distinguish successful from unsuccessful jailbreaking prompts (further details in \ref{['sec:intervention']}).
  • Figure 3: Probe tends to increase with layer depth. Vertical lines show confidence intervals for probe accuracy. Y-axis minimum represents random chance. See \ref{['sec:intervention']} for intervention details and \ref{['sec:confusion-matrices']} for exact values.
  • Figure 4: Non-linear MLP probes (Bottom) transfer better to unseen layers than linear probes (Top). Transfer to other layers is measured by the total amount of probes that achieve $> 80 \%$ accuracy when predicting unseen layers (excluding the diagonal).
  • Figure 5: Both linear and non-linear probes have a limited ability to classify successful jailbreaking prompts from held-out attack methods. This suggests that successful jailbreaks from different methods attack the model using different, non-linear prompt features. Per model and intervention, we train a set of 10 probes, each with one attack type held out. The blue bars correspond to the train accuracy of the probes when trained on all attack types minus the held-out one, while the red bars correspond to the accuracy of the same probe on the held-out attack. The error bars represent an upper bound for standard error of the test accuracy for each hold-out attack type, calculated as: $\sqrt{\text{test\_acc} (1 - \text{test\_acc}) / n_{\text{test}}}$. The dashed red lines indicate a random guess baseline.
  • ...and 10 more figures