Table of Contents
Fetching ...

Targeting Alignment: Extracting Safety Classifiers of Aligned LLMs

Jean-Charles Noirot Ferrand, Yohan Beugin, Eric Pauley, Ryan Sheatsley, Patrick McDaniel

TL;DR

Aligned LLMs are vulnerable to jailbreaks even with safety objectives. The authors hypothesize and demonstrate that a safety classifier is embedded in the model, and they present a practical method to extract a surrogate classifier from a small substructure using linear probing. Their empirical evaluation across multiple open-weight models and two jailbreak-focused datasets shows that a surrogate using as little as 20% of the model can match the embedded safety classifier with F1 > 0.8, and adversarial inputs crafted on the surrogate transfer to the full LLM with high success and lower compute costs. This scalable framework enables efficient red-teaming and defense guidance by focusing on the core safety boundary rather than the entire model, with potential applicability to other alignment failures such as hallucinations and bias.

Abstract

Alignment in large language models (LLMs) is used to enforce guidelines such as safety. Yet, alignment fails in the face of jailbreak attacks that modify inputs to induce unsafe outputs. In this paper, we introduce and evaluate a new technique for jailbreak attacks. We observe that alignment embeds a safety classifier in the LLM responsible for deciding between refusal and compliance, and seek to extract an approximation of this classifier: a surrogate classifier. To this end, we build candidate classifiers from subsets of the LLM. We first evaluate the degree to which candidate classifiers approximate the LLM's safety classifier in benign and adversarial settings. Then, we attack the candidates and measure how well the resulting adversarial inputs transfer to the LLM. Our evaluation shows that the best candidates achieve accurate agreement (an F1 score above 80%) using as little as 20% of the model architecture. Further, we find that attacks mounted on the surrogate classifiers can be transferred to the LLM with high success. For example, a surrogate using only 50% of the Llama 2 model achieved an attack success rate (ASR) of 70% with half the memory footprint and runtime -- a substantial improvement over attacking the LLM directly, where we only observed a 22% ASR. These results show that extracting surrogate classifiers is an effective and efficient means for modeling (and therein addressing) the vulnerability of aligned models to jailbreaking attacks. The code is available at https://github.com/jcnf0/targeting-alignment.

Targeting Alignment: Extracting Safety Classifiers of Aligned LLMs

TL;DR

Aligned LLMs are vulnerable to jailbreaks even with safety objectives. The authors hypothesize and demonstrate that a safety classifier is embedded in the model, and they present a practical method to extract a surrogate classifier from a small substructure using linear probing. Their empirical evaluation across multiple open-weight models and two jailbreak-focused datasets shows that a surrogate using as little as 20% of the model can match the embedded safety classifier with F1 > 0.8, and adversarial inputs crafted on the surrogate transfer to the full LLM with high success and lower compute costs. This scalable framework enables efficient red-teaming and defense guidance by focusing on the core safety boundary rather than the entire model, with potential applicability to other alignment failures such as hallucinations and bias.

Abstract

Alignment in large language models (LLMs) is used to enforce guidelines such as safety. Yet, alignment fails in the face of jailbreak attacks that modify inputs to induce unsafe outputs. In this paper, we introduce and evaluate a new technique for jailbreak attacks. We observe that alignment embeds a safety classifier in the LLM responsible for deciding between refusal and compliance, and seek to extract an approximation of this classifier: a surrogate classifier. To this end, we build candidate classifiers from subsets of the LLM. We first evaluate the degree to which candidate classifiers approximate the LLM's safety classifier in benign and adversarial settings. Then, we attack the candidates and measure how well the resulting adversarial inputs transfer to the LLM. Our evaluation shows that the best candidates achieve accurate agreement (an F1 score above 80%) using as little as 20% of the model architecture. Further, we find that attacks mounted on the surrogate classifiers can be transferred to the LLM with high success. For example, a surrogate using only 50% of the Llama 2 model achieved an attack success rate (ASR) of 70% with half the memory footprint and runtime -- a substantial improvement over attacking the LLM directly, where we only observed a 22% ASR. These results show that extracting surrogate classifiers is an effective and efficient means for modeling (and therein addressing) the vulnerability of aligned models to jailbreaking attacks. The code is available at https://github.com/jcnf0/targeting-alignment.

Paper Structure

This paper contains 29 sections, 6 equations, 15 figures, 5 tables.

Figures (15)

  • Figure 1: In this work, we (A) hypothesize that alignment embeds a safety classifier in LLMs responsible for the classification of safe and unsafe inputs. Then, we design an approach that (B) builds candidate classifiers using the model structure. We evaluate these candidates in benign and adversarial settings to select the best, called the surrogate classifier. Finally, we (C) attack the surrogate classifier to generate adversarial inputs that transfer to the LLM.
  • Figure 2: Silhouette score (measure of separation) of unsafe and safe input embeddings for different LLMs.
  • Figure 3: Methodology overview. In the first step, we extract the safety classifier of an LLM by (A) selecting a structure within the model and (B) training a classification head on the predictions of the LLM to create a candidate classifier. We verify its performance in benign settings. Then, we measure the transferability between the LLM and the candidate in both ways: adversarial examples of the LLM to the candidate (C) and adversarial examples of the candidate to the LLM (D).
  • Figure 4: Test $F_1$ of the candidate classifiers in benign settings, depending on the normalized candidate size.
  • Figure 5: Test $F_1$ of the candidate classifiers in benign settings, on the dataset they were not trained on.
  • ...and 10 more figures