Table of Contents
Fetching ...

Bleeding Pathways: Vanishing Discriminability in LLM Hidden States Fuels Jailbreak Attacks

Yingjie Zhang, Tong Liu, Zhe Zhao, Guozhu Meng, Kai Chen

TL;DR

This work identifies a fundamental vulnerability in LLM safety: the discriminability between harmful and safe representations degrades as generation proceeds, undermining intent disambiguation and safety-utility trade-offs. It introduces DeepAlign, an endogenous defense that uses contrastive hidden-state steering at the generation midpoint, along with Detoxify and Retain losses, to maintain separability and enable continuous toxicity detection. Across diverse models and attacks, DeepAlign achieves near-zero jailbreak success while preserving utility and reducing over-refusal, substantially improving the safety-utility frontier. The results advocate for generation-time, hidden-state–level defenses as a robust path forward for intrinsic LLM safety.

Abstract

LLMs remain vulnerable to jailbreak attacks that exploit adversarial prompts to circumvent safety measures. Current safety fine-tuning approaches face two critical limitations. First, they often fail to strike a balance between security and utility, where stronger safety measures tend to over-reject harmless user requests. Second, they frequently miss malicious intent concealed within seemingly benign tasks, leaving models exposed to exploitation. Our work identifies a fundamental cause of these issues: during response generation, an LLM's capacity to differentiate harmful from safe outputs deteriorates. Experimental evidence confirms this, revealing that the separability between hidden states for safe and harmful responses diminishes as generation progresses. This weakening discrimination forces models to make compliance judgments earlier in the generation process, restricting their ability to recognize developing harmful intent and contributing to both aforementioned failures. To mitigate this vulnerability, we introduce DEEPALIGN - an inherent defense framework that enhances the safety of LLMs. By applying contrastive hidden-state steering at the midpoint of response generation, DEEPALIGN amplifies the separation between harmful and benign hidden states, enabling continuous intrinsic toxicity detection and intervention throughout the generation process. Across diverse LLMs spanning varying architectures and scales, it reduced attack success rates of nine distinct jailbreak attacks to near-zero or minimal. Crucially, it preserved model capability while reducing over-refusal. Models equipped with DEEPALIGN exhibited up to 3.5% lower error rates in rejecting challenging benign queries and maintained standard task performance with less than 1% decline. This marks a substantial advance in the safety-utility Pareto frontier.

Bleeding Pathways: Vanishing Discriminability in LLM Hidden States Fuels Jailbreak Attacks

TL;DR

This work identifies a fundamental vulnerability in LLM safety: the discriminability between harmful and safe representations degrades as generation proceeds, undermining intent disambiguation and safety-utility trade-offs. It introduces DeepAlign, an endogenous defense that uses contrastive hidden-state steering at the generation midpoint, along with Detoxify and Retain losses, to maintain separability and enable continuous toxicity detection. Across diverse models and attacks, DeepAlign achieves near-zero jailbreak success while preserving utility and reducing over-refusal, substantially improving the safety-utility frontier. The results advocate for generation-time, hidden-state–level defenses as a robust path forward for intrinsic LLM safety.

Abstract

LLMs remain vulnerable to jailbreak attacks that exploit adversarial prompts to circumvent safety measures. Current safety fine-tuning approaches face two critical limitations. First, they often fail to strike a balance between security and utility, where stronger safety measures tend to over-reject harmless user requests. Second, they frequently miss malicious intent concealed within seemingly benign tasks, leaving models exposed to exploitation. Our work identifies a fundamental cause of these issues: during response generation, an LLM's capacity to differentiate harmful from safe outputs deteriorates. Experimental evidence confirms this, revealing that the separability between hidden states for safe and harmful responses diminishes as generation progresses. This weakening discrimination forces models to make compliance judgments earlier in the generation process, restricting their ability to recognize developing harmful intent and contributing to both aforementioned failures. To mitigate this vulnerability, we introduce DEEPALIGN - an inherent defense framework that enhances the safety of LLMs. By applying contrastive hidden-state steering at the midpoint of response generation, DEEPALIGN amplifies the separation between harmful and benign hidden states, enabling continuous intrinsic toxicity detection and intervention throughout the generation process. Across diverse LLMs spanning varying architectures and scales, it reduced attack success rates of nine distinct jailbreak attacks to near-zero or minimal. Crucially, it preserved model capability while reducing over-refusal. Models equipped with DEEPALIGN exhibited up to 3.5% lower error rates in rejecting challenging benign queries and maintained standard task performance with less than 1% decline. This marks a substantial advance in the safety-utility Pareto frontier.

Paper Structure

This paper contains 30 sections, 4 equations, 7 figures, 11 tables, 1 algorithm.

Figures (7)

  • Figure 1: Conceptual comparison of jailbreak attempts on conventionally aligned LLMs (left) and those aligned with our technique (right). While a single attack method is shown here as an example, our method defends against existing attacks designed to elicit toxic content.
  • Figure 2: Test accuracy of linear classifiers of benign and harmful hidden states across response tokens for each layer.
  • Figure 3: Detoxify Loss Generation Process Example, with special tokens of the chat template denoted as "<sep>". These special tokens, employed by all LLMs we assessed in different forms, are crucial for distinguishing responses from queries. The original harmful query is denoted by "{{malicious query}}", and "{{prompt}}" denotes the prompt template for generating safe responses. For clarity, the figure illustrates the last two words ("Informative Tone") of the detoxified prompt. The reference model processes the detoxified prompt, the malicious query, and the words concatenated after them, and produce hidden states ($H_{safe}$ and $H_{bad}$) accordingly.
  • Figure 4: Test accuracies of linear probes of each token position and layer.
  • Figure 5: Test accuracies of linear probes of each layer for different token positions within the reasoning sequence.
  • ...and 2 more figures