Table of Contents
Fetching ...

Latent Adversarial Training Improves the Representation of Refusal

Alexandra Abbas, Nora Petrova, Helios Ael Lyons, Natalia Perez-Campanero

TL;DR

This work investigates Latent Adversarial Training (LAT) as a safety-enhancing approach that perturbs internal representations rather than inputs to improve refusal robustness. By analyzing Llama-2-7B-chat, it shows LAT reshapes the latent encoding of refusal, concentrating it into the first two SVD components to explain about $75\%$ of activation variance, making refusal vectors more transferable across models. However, LAT also exhibits a vulnerability to self-generated vectors, performing worse than AT in self-attacks, which reveals a nuanced trade-off between robustness and exposure to tailored latent-space manipulations. Overall, LAT offers a promising direction for improving safety, with notable strengths in cross-model transfer but clear areas for mitigating self-attacks and further validating the latent-direction interpretations across architectures and datasets.

Abstract

Recent work has shown that language models' refusal behavior is primarily encoded in a single direction in their latent space, making it vulnerable to targeted attacks. Although Latent Adversarial Training (LAT) attempts to improve robustness by introducing noise during training, a key question remains: How does this noise-based training affect the underlying representation of refusal behavior? Understanding this encoding is crucial for evaluating LAT's effectiveness and limitations, just as the discovery of linear refusal directions revealed vulnerabilities in traditional supervised safety fine-tuning (SSFT). Through the analysis of Llama 2 7B, we examine how LAT reorganizes the refusal behavior in the model's latent space compared to SSFT and embedding space adversarial training (AT). By computing activation differences between harmful and harmless instruction pairs and applying Singular Value Decomposition (SVD), we find that LAT significantly alters the refusal representation, concentrating it in the first two SVD components which explain approximately 75 percent of the activation differences variance - significantly higher than in reference models. This concentrated representation leads to more effective and transferable refusal vectors for ablation attacks: LAT models show improved robustness when attacked with vectors from reference models but become more vulnerable to self-generated vectors compared to SSFT and AT. Our findings suggest that LAT's training perturbations enable a more comprehensive representation of refusal behavior, highlighting both its potential strengths and vulnerabilities for improving model safety.

Latent Adversarial Training Improves the Representation of Refusal

TL;DR

This work investigates Latent Adversarial Training (LAT) as a safety-enhancing approach that perturbs internal representations rather than inputs to improve refusal robustness. By analyzing Llama-2-7B-chat, it shows LAT reshapes the latent encoding of refusal, concentrating it into the first two SVD components to explain about of activation variance, making refusal vectors more transferable across models. However, LAT also exhibits a vulnerability to self-generated vectors, performing worse than AT in self-attacks, which reveals a nuanced trade-off between robustness and exposure to tailored latent-space manipulations. Overall, LAT offers a promising direction for improving safety, with notable strengths in cross-model transfer but clear areas for mitigating self-attacks and further validating the latent-direction interpretations across architectures and datasets.

Abstract

Recent work has shown that language models' refusal behavior is primarily encoded in a single direction in their latent space, making it vulnerable to targeted attacks. Although Latent Adversarial Training (LAT) attempts to improve robustness by introducing noise during training, a key question remains: How does this noise-based training affect the underlying representation of refusal behavior? Understanding this encoding is crucial for evaluating LAT's effectiveness and limitations, just as the discovery of linear refusal directions revealed vulnerabilities in traditional supervised safety fine-tuning (SSFT). Through the analysis of Llama 2 7B, we examine how LAT reorganizes the refusal behavior in the model's latent space compared to SSFT and embedding space adversarial training (AT). By computing activation differences between harmful and harmless instruction pairs and applying Singular Value Decomposition (SVD), we find that LAT significantly alters the refusal representation, concentrating it in the first two SVD components which explain approximately 75 percent of the activation differences variance - significantly higher than in reference models. This concentrated representation leads to more effective and transferable refusal vectors for ablation attacks: LAT models show improved robustness when attacked with vectors from reference models but become more vulnerable to self-generated vectors compared to SSFT and AT. Our findings suggest that LAT's training perturbations enable a more comprehensive representation of refusal behavior, highlighting both its potential strengths and vulnerabilities for improving model safety.

Paper Structure

This paper contains 22 sections, 2 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Comparison of refusal rates under different ablation attack vectors across Llama-2-7B-chat model variants. The baseline SSFT model is denoted simply as "Llama-2-7B-chat" in the figure. The red bars ("Self-generated refusal vector") represent each model's refusal rate when attacked using a refusal vector generated from its own activations. The gray bars ("Refusal vector generated from baseline") show the refusal rate when attacked using a vector from the baseline Llama-2-7B-chat model. The green bars ("Refusal vector generated from LAT") indicate the refusal rate when attacked using a vector generated from the LAT model. All statistics are computed from a test set of 520 examples; see \ref{['app:statistics']} for statistical confidence measures.
  • Figure 2: Explained variance by SVD components across model variants. The plot shows the percentage of variance explained by the first six SVD components of activation differences between harmful and harmless instruction pairs for the base Llama-2-7B-chat model and its embeddings AT and LAT variants. While the first components of baseline and AT variants explain 49.43% and 43.76% of variance respectively, their second components only account for about 5% each. In contrast, the LAT variant not only has a strong first component (54%) but also substantially used its second component (20%), suggesting a more concentrated two-dimensional encoding of refusal.
  • Figure 3: Principal Component Analysis (PCA) visualization of harmful vs harmless instruction representations across different network layers and model variants. Each point represents the activation pattern for a single instruction, projected onto the first two principal components. Blue points indicate harmless instructions, while red points represent harmful instructions. The plots reveal how LAT affects the separability of these instruction types in the model's latent space.
  • Figure 4: Layer-wise analysis of refusal rates under self-generated refusal vector attacks. The plot shows how refusal rates vary across different layers of the model architecture for the base Llama-2-7B-chat model and its Embeddings AT and LAT variants when attacked using their own refusal vectors from the same layer.