Table of Contents
Fetching ...

Matching Ranks Over Probability Yields Truly Deep Safety Alignment

Jason Vega, Gagandeep Singh

TL;DR

The paper reveals that rank-based exploitation (RAP) can undermine data-augmentation–driven deep safety alignment, exposing a gap where low-probability but high-ranked harmful tokens remain accessible. It reframes safety alignment as a rank-matching problem and introduces Push-Forward Alignment (PFA) and the attention-regularization method PRESTO to push forward refusals and suppress harmful prefill influence. Empirical results across multiple open-source LLMs show PRESTO substantially increases resistance to RAP with minimal utility loss, while analyses of attention patterns support the mechanistic interpretation. These findings offer a practical, scalable path toward Truly Deep Safety Alignment in instruction-tuned LLMs.

Abstract

A frustratingly easy technique known as the prefilling attack has been shown to effectively circumvent the safety alignment of frontier LLMs by simply prefilling the assistant response with an affirmative prefix before decoding. In response, recent work proposed a supervised fine-tuning (SFT) defense using data augmentation to achieve a \enquote{deep} safety alignment, allowing the model to generate natural language refusals immediately following harmful prefills. Unfortunately, we show in this work that the "deep" safety alignment produced by such an approach is in fact not very deep. A generalization of the prefilling attack, which we refer to as the Rank-Assisted Prefilling (RAP) attack, can effectively extract harmful content from models fine-tuned with the data augmentation defense by selecting low-probability "harmful" tokens from the top 20 predicted next tokens at each step (thus ignoring high-probability "refusal" tokens). We argue that this vulnerability is enabled due to the "gaming" of the SFT objective when the target distribution entropies are low, where low fine-tuning loss is achieved by shifting large probability mass to a small number of refusal tokens while neglecting the high ranks of harmful tokens. We then propose a new perspective on achieving deep safety alignment by matching the token ranks of the target distribution, rather than their probabilities. This perspective yields a surprisingly simple fix to the data augmentation defense based on regularizing the attention placed on harmful prefill tokens, an approach we call PRefill attEntion STOpping (PRESTO). Adding PRESTO yields up to a 4.7x improvement in the mean StrongREJECT score under RAP attacks across three popular open-source LLMs, with low impact to model utility.

Matching Ranks Over Probability Yields Truly Deep Safety Alignment

TL;DR

The paper reveals that rank-based exploitation (RAP) can undermine data-augmentation–driven deep safety alignment, exposing a gap where low-probability but high-ranked harmful tokens remain accessible. It reframes safety alignment as a rank-matching problem and introduces Push-Forward Alignment (PFA) and the attention-regularization method PRESTO to push forward refusals and suppress harmful prefill influence. Empirical results across multiple open-source LLMs show PRESTO substantially increases resistance to RAP with minimal utility loss, while analyses of attention patterns support the mechanistic interpretation. These findings offer a practical, scalable path toward Truly Deep Safety Alignment in instruction-tuned LLMs.

Abstract

A frustratingly easy technique known as the prefilling attack has been shown to effectively circumvent the safety alignment of frontier LLMs by simply prefilling the assistant response with an affirmative prefix before decoding. In response, recent work proposed a supervised fine-tuning (SFT) defense using data augmentation to achieve a \enquote{deep} safety alignment, allowing the model to generate natural language refusals immediately following harmful prefills. Unfortunately, we show in this work that the "deep" safety alignment produced by such an approach is in fact not very deep. A generalization of the prefilling attack, which we refer to as the Rank-Assisted Prefilling (RAP) attack, can effectively extract harmful content from models fine-tuned with the data augmentation defense by selecting low-probability "harmful" tokens from the top 20 predicted next tokens at each step (thus ignoring high-probability "refusal" tokens). We argue that this vulnerability is enabled due to the "gaming" of the SFT objective when the target distribution entropies are low, where low fine-tuning loss is achieved by shifting large probability mass to a small number of refusal tokens while neglecting the high ranks of harmful tokens. We then propose a new perspective on achieving deep safety alignment by matching the token ranks of the target distribution, rather than their probabilities. This perspective yields a surprisingly simple fix to the data augmentation defense based on regularizing the attention placed on harmful prefill tokens, an approach we call PRefill attEntion STOpping (PRESTO). Adding PRESTO yields up to a 4.7x improvement in the mean StrongREJECT score under RAP attacks across three popular open-source LLMs, with low impact to model utility.

Paper Structure

This paper contains 37 sections, 3 equations, 11 figures, 8 tables, 1 algorithm.

Figures (11)

  • Figure 1: A demonstration of the Rank-Assisted Prefilling (RAP) attack against the Llama 2 7B Chat checkpoint fine-tuned for deep safety alignment from qi2025safety on a request for bomb-making instructions. In the first step (left), we show the top 10 tokens and their probabilities from the next token probability distribution following a harmful prefill (red). Nearly all of the probability mass is concentrated on the top-ranked token "I", yet selecting this token leads to many future decoding paths that refuse the request, such as "I cannot fulfill your request ..." Instead, the "the" token can be selected at this step despite its low probability, and then appended to the input to yield the input for the next step. Repeating this process extracts harmful content fulfilling the request that is not likely to be generated by traditional sampling-based decoding strategies.
  • Figure 2: An illustration of the Push-Forward Alignment (PFA) approach to deep safety alignment on a request for bomb-making instructions. On the far left, we show the top 10 tokens for the first decoding step from the original Llama 2 7B Chat model touvron2023llama when given the prompt without any prefill. The highest-ranked future decoding paths from these tokens tend to be refusals (e.g., "Sorry, but I cannot fulfill..."). When a harmful prefill is added, the top-ranked tokens from the first step are "pushed forward" to the current step, which helps to reduce the presence of harmful tokens that continue the prefill. Highly-ranked future decoding paths from the first step can also be pushed forward to help enable natural language refusal generation under other threat models (such as prefilling attacks under traditional sampling-based decoding strategies).
  • Figure 3: Mean StrongREJECT scores of RAP attacks for models fine-tuned with the data augmentation approach of qi2025safety, with (orange) and without (blue) PRESTO. Scores are on a scale of $[0, 1]$ with higher values indicating greater harmfulness. For the human RAP evaluation, we display the mean and standard deviation over three participants. "DA" denotes the data augmentation approach of qi2025safety.
  • Figure 4: Few-shot prompt template for prefilling attack generation. For brevity, we omit most of the few shot examples. In total, there are 25 few shot examples used. $<$query prompt$>$ is replaced with the actual prompt to generate the prefill for.
  • Figure 5: A screenshot of the terminal interface used during the human RAP evaluation. In this example, the user has already completed one prompt, taking a time of three minutes and eight seconds. The second prompt is currently being attacked, and the user has already taken three attack steps (which took a total of three seconds). The top 20 predicted next tokens are shown for the current prefill. The ">" symbol indicates where the user can enter their actions via the keyboard.
  • ...and 6 more figures